teilab.datasets module

This module handles Datasets used in the lecture. The table below describes the meaning of the values in each column of the data used in the lecture. If you want to refer each class or method, click HERE to skip it.

Reference: Agilent Feature Extraction 12.0 Reference Guide

FEATURES

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=True)
>>> df_data = datasets.read_data(no=0)
>>> df_data.columns
    Index(['FEATURES', 'FeatureNum', 'Row', 'Col', 'accessions', ... ], dtype='object')

G2505C

G2600D

Features

Types

Description

1

1

FeatureNum

int

Feature number.

2

2

Row

int

Feature location: row.

3

3

Col

int

Feature location: column.

4

accessions

str

Gene accession numbers.

5

chr_coord

str

Chromosome coordinates of the feature.

6

4

SubTypeMask

int

Numeric code defining the subtype of any control feature.

7

SubTypeName

int

Name of the subtype of any control feature.

8

Start

int

Indicates the place in the transcript where the probe sequence starts.

9

Sequence

str

The sequence of bases printed on the array.

10

ProbeUID

int

Unique integer for each unique probe in a design.

11

5

ControlType

int

Feature control type.

12

6

ProbeName

str

An Agilent-assigned identifier for the probe synthesized on the microarray.

13

GeneName

str

This is an identifier for the gene for which the probe provides expression information. The target sequence identified by the systematic name is normally a representative or consensus sequence for the gene.

14

7

SystematicName

str

This is an identifier for the target sequence that the probe was designed to hybridize with. Where possible, a public database identifier is used (e.g., TAIR locus identifier for Arabidopsis). SystematicName is reported ONLY if Gene name and Systematic name are different.

15

Description

str

Description of gene.

16

8

PositionX

float

Found coordinates (X) of the feature centroid in microns.

17

9

PositionY

float

Found coordinates (Y) of the feature centroid in microns.

18

gSurrogateUsed

float

If gSurrogateUsed is Non-zero value, The g surrogate value used. Otherwise, No surrogate value used.

19

gIsFound

bool

A boolean used to flag found features. The flag is applied independently in each channel. (1 = IsFound, 0 = IsNotFound) A feature is considered Found if two conditions are true: 1) the difference between the feature signal and the local background signal is more than 1.5 times the local background noise and 2) the spot diameter is at least 0.30 times the nominal spot diameter.

20

10

gProcessedSignal

float

The signal left after all the Feature Extraction processing steps have been completed. In the case of one color, ProcesssedSignal contains the Multiplicatively Detrended BackgroundSubtracted Signal if the detrending is selected and helps. If the detrending does not help, this column will contain the BackgroundSubtractedSignal.

21

11

gProcessedSigError

float

The universal or propagated error left after all the processing steps of Feature Extraction have been completed. In the case of one color, ProcessedSignalError has had the Error Model applied and will contain at least the larger of the universal (UEM) error or the propagated error. If multiplicative detrending is performed, ProcessedSignalError contains the error propagated from detrending. This is done by dividing the error by the normalized MultDetrendSignal.

22

gNumPixOLHi

int

Number of outlier pixels per feature with intensity > upper threshold set via the pixel outlier rejection method. The number is computed independently in each channel. These pixels are omitted from all subsequent calculations.

23

gNumPixOLLo

int

Number of outlier pixels per feature with intensity < lower threshold set via the pixel outlier rejection method. The number is computed independently in each channel. These pixels are omitted from all subsequent calculations. NOTE: The pixel outlier method is the ONLY step that removes data in Feature Extraction.

24

gNumPix

int

Total number of pixels used to compute feature statistics; i.e. total number of inlier pixels/per spot; same in both channels.

25

gMeanSignal

float

Raw mean signal of feature from inlier pixels in green and/or red channel.

26

12

gMedianSignal

float

Raw median signal of feature from inlier pixels in green and/or red channel.

27

gPixSDev

float

Standard deviation of all inlier pixels per feature; this is computed independently in each channel.

28

gPixNormIQR

float

The normalized Inter-quartile range of all of the inlier pixels per feature. The range is computed independently in each channel.

29

gBGNumPix

int

Total number of pixels used to compute local BG statistics per spot; i.e. total number of BG inlier pixels; same in both channels.

30

25

gBGMeanSignal

float

Mean local background signal (local to corresponding feature) computed per channel (inlier pixels).

31

13

gBGMedianSignal

float

Median local background signal (local to corresponding feature) computed per channel (inlier pixels).

32

14

gBGPixSDev

float

Standard deviation of all inlier pixels per local BG of each feature, computed independently in each channel.

33

gBGPixNormIQR

float

The normalized Inter-quartile range of all of the inlier pixels per local BG of each feature. The range is computed independently in each channel.

34

gNumSatPix

int

Total number of saturated pixels per feature, computed per channel.

35

15

gIsSaturated

bool

Boolean flag indicating if a feature is saturated or not. A feature is saturated. IF 50% of the pixels in a feature are above the saturation threshold. (1 = Saturated, 0 = Not saturated).

36

16

gIsFeatNonUnifOL

bool

Boolean flag indicating if a feature is a NonUniformity Outlier or not. A feature is non-uniform if the pixel noise of feature exceeds a threshold established for a “uniform” feature. g(r)IsFeatNonUnifOL = 1 indicates Feature is a non-uniformity outlier in g(r).

37

17

gIsBGNonUnifOL

bool

The same concept as above but for background. g(r)IsBGNonUnifOL = 1 indicates Local background is a non-uniformity outlier in g(r).

38

18

gIsFeatPopnOL

bool

Boolean flag indicating if a feature is a Population Outlier or not. Probes with replicate features on a microarray are examined using population statistics. A feature is a population outlier if its signal is less than a lower threshold or exceeds an upper threshold determined using a multiplier (1.42) times the interquartile range (i.e., IQR) of the population. g(r)IsFeatPopnOL = 1 indicates Feature is a population outlier in g(r).

39

19

gIsBGPopnOL

bool

The same concept as above but for background. g(r)IsBGPopnOL = 1 indicates local background is a population outlier in g(r).

40

20

IsManualFlag

bool

Boolean to flag features for downstream filtering in third party gene expression software.

41

21

gBGSubSignal

float

Background-subtracted signal. To display the values used to calculate this variable using different background signals and settings of spatial detrend and global background adjust, see Table 34 on page 254 . g(r)BGSubSignal = g(r)MeanSignal - g(r)BGUsed.

42

gBGSubSigError

float

Propagated standard error as computed on net g(r) background-subtracted signal. For one color, the error model is applied to the background-subtracted signal. This will contain the larger of he universal (UEM) error or the propagated error.

43

22

gIsPosAndSignif

bool

Boolean flag, established via a 2-sided t-test, indicates if the mean signal of a feature is greater than the corresponding background (selected by user) and if this difference is significant. g(r)isPosAndSignif = 1 indicates Feature is positive and significant above background.

44

gPValFeatEqBG

float

pValue from t-test of significance between g(r)Mean signal and g(r) background (selected by user).

45

gNumBGUsed

int

Number of local background regions or features used to calculate the background used for background subtraction on this feature.

46

23

gIsWellAboveBG

bool

Boolean flag indicating if a feature is WellAbove Background or not, feature passes g(r)IsPosAndSignif and additionally the g(r)BGSubSignal is greater than 2.6*g(r)BG_SD. You can change the multiplier 2.6.

47

gBGUsed

float

Background used to subtract from the MeanSignal; variable also used in t-test. To display the values used to calculate this variable using different background signals and settings of spatial detrend and global background adjust, see Table 34 on page 254 . g(r)BGSubSignal = g(r)MeanSignal - g(r)BGUsed.

48

gBGSDUsed

float

Standard deviation of background used in g(r) channel; variable also used in t-test and surrogate algorithms. To display the values used to calculate this variable using different background signals and settings of spatial detrend and global background adjust, see Table 34 on page 254 .

49

ErrorModel

bool

Indicates the error model that you chose for Feature Extraction or that the software uses if you have chosen the “Most Conservative” option. 0 = Propagated model chosen by you or by software. 1 = Universal error model chosen by you or by software.

50

gSpatialDetrendIsInFilteredSet

bool

Set to true for a given feature if it is part of the filtered set used to detrend the background. This feature is considered part of the locally weighted lowest x% of features as defined by the DetrendLowPassPercentage. 1 = Feature in filtered set. 0 = Feature not in filtered set.

51

gSpatialDetrendSurfaceValue

float

Value of the smoothed surface calculated by the Spatial detrend algorithm.

52

24

SpotExtentX

float

Diameter of the spot (X-axis).

53

SpotExtentY

float

Diameter of the spot (Y-axis).

54

gNetSignal

float

MeanSignal minus DarkOffset.

55

gMultDetrendSignal

float

A surface is fitted through the log of the background-subtracted signal to look for multiplicative gradients. A normalized version of that surface interpolated at each point of the microarray is stored in MultDetrendSignal. The surface is normalized by dividing each point by the overall average of the surface. That average is stored in MultDetrendSurfaceAverage as a statistic. 1-color only.

56

gProcessedBackground

float

Indicates the Background signal that was selected to be used (Mean or Median).

57

gProcessedBkngError

float

Indicates the Background error that was selected to be used (PixSD or NormIQR) .

58

IsUsedBGAdjust

bool

A Boolean used to flag features used for computation of global BG offset. 1 = Feature used. 0 = Feature not used.

59

gInterpolatedNegCtrlSub

float

Value at the polynomial fit of the negative controls.

60

gIsInNegCtrlRange

bool

Set to true for a given feature if its signal intensity is in the negative control range.

61

gIsUsedInMD

bool

Indicates whether this feature was included in the set used to generate the multiplicative detrend surface.

Background Subtraction

The feature background-subtracted signal, BGSubSignal, is calculated by subtracting a value called the BGUsed from the feature mean signal.

\[\text{ BGSubSignal } = \text{ MeanSignal } - \text{ BGUsed }\]

where BGSubSignal and BGUsed depend on the type of background method and the settings for spatial detrend and global background adjust. See the following table.

Background Subtraction Method

Background Subtraction Variable

SpDe OFF / GBA OFF

SpDe ON / GBA OFF

SpDe OFF / GBA ON

SpDe ON / GBA ON

No background subtract

BGUsed =

BGMeanSignal

SDSV

BGAdjust

SDSV + BGAdjust

BGSDUsed =

BGPixSDev

BGPixSDev

BGPixSDev

BGPixSDev

BGSubSignal =

MeanSignal

MeanSignal - BGUsed

MeanSignal - BGUsed

MeanSignal - BGUsed

Local Background

BGUsed =

BGMeanSignal

BGMeanSignal + SDSV

BGMeanSignal + BGAdjust

BGMeanSignal + SDSV + BGAdjust

BGSDUsed =

BGPixSDev

BGPixSDev

BGPixSDev

BGPixSDev

BGSubSignal =

MeanSignal - BGUsed

MeanSignal - BGUsed

MeanSignal - BGUsed

MeanSignal - BGUsed

Global Background method

BGUsed =

GBGIA

GBGIA + SDSV

GBGIA + BGAdjust

GBGIA + SDSV + BGAdjust

BGSDUsed =

GBGISD

GBGISD

GBGISD

GBGISD

BGSubSignal =

MeanSignal - BGUsed

MeanSignal - BGUsed

MeanSignal - BGUsed

MeanSignal - BGUsed

  • SpDe: Spatial Detrend

  • SDSV: SpatialDetrendSurfaceValue

  • GBA: Global Bkgnd Adjust

  • GBGISD: GlobalBGInlierSDev

  • GBGISD: GlobalBGInlierSDev

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> df_meta = datasets.read_meta(no=-1)
>>> df_meta["BGSubtractor_BGSubMethod"].values[0]
    7

The BGSubMethod of 7 corresponds to “No Background Subtraction method” (see Table 17 on page 129 .).

  • Global Background Adjustment is turned Off.

  • Spatial Detrending is turned On.

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> df_data = datasets.read_data(no=-1)
>>> all(df_data["gBGUsed"] == df_data["gSpatialDetrendSurfaceValue"])
    True
>>> all(abs((df_data["gBGSDUsed"]-df_data["gBGPixSDev"])/df_data["gBGSDUsed"])<1e-3)
    True
>>> all(abs((df_data["gBGSubSignal"]-(df_data["gMeanSignal"]-df_data["gBGUsed"]))/df_data["gBGSubSignal"]) < 1.4e-1)
    True
class teilab.datasets.Samples(sample_list_path: str)[source]

Bases: object

Utility Sample Class for this lecture.

_df

Sample information described in text file at sample_list_path .

Type

pd.DataFrame

SampleNumber

Index numbers for each sample.

Type

NDArray[Any,str]

FileName

File namse for each sample.

Type

NDArray[Any,str]

Condition

Experimental conditions for each sample.

Type

NDArray[Any,str]

_group_names

i-th group’s filename prefix.

Type

NDArray[Any,str]

_group_numbers

Which group (no) the j-th sample belongs to

Type

NDArray[Any,np.uint8]

Examples

>>> from teilab.datasets import Samples
>>> from teilab.utils._path import SAMPLE_LIST_PATH
>>> samples = Samples(sample_list_path=SAMPLE_LIST_PATH)
>>> print(sorted(samples.__dict__.keys()))
['Condition', 'FileName', 'SampleNumber', '_df', '_group_names', '_group_numbers']
grouping() None[source]

Grouping the samples based on their filenames.

property groups
show_groups(tablefmt: str = 'simple') None[source]

Show groups neatly.

Parameters

tablefmt (str, optional) – Table formats. Please choose from ["plain", "simple", "grid", "pipe", "orgtbl", "rst", "mediawiki", "latex", "latex_raw", "latex_booktabs", "latex_longtable", "tsv"] . Defaults to "simple".

Examples

>>> from teilab.datasets import Samples
>>> from teilab.utils._path import SAMPLE_LIST_PATH
>>> samples = Samples(sample_list_path=SAMPLE_LIST_PATH)
>>> samples.show_groups()
  idx    gn  GroupName                              FileName
-----  ----  -------------------------------------  ---------------------------------------------------
    0     0  SG19378659_257236339458_S001_GE1_1200  SG19378659_257236339458_S001_GE1_1200_Jun14_1_1.txt
    1     0  SG19378659_257236339458_S001_GE1_1200  SG19378659_257236339458_S001_GE1_1200_Jun14_1_2.txt
    :     :                   :                                            :
   11     1  US91503671_253949442637_S01_GE1_105    US91503671_253949442637_S01_GE1_105_Dec08_1_4.txt
   12     1  US91503671_253949442637_S01_GE1_105    US91503671_253949442637_S01_GE1_105_Dec08_2_2.txt
get_group_numbers(group_no: Optional[int] = None, group_name: Optional[str] = None) List[int][source]

Get the specified group index List.

Parameters
  • group_no (Optional[int], optional) – [description]. Defaults to None.

  • group_name (Optional[str], optional) – [description]. Defaults to None.

Returns

[description]

Return type

List[int]

Examples

>>> from teilab.datasets import Samples
>>> from teilab.utils._path import SAMPLE_LIST_PATH
>>> samples = Samples(sample_list_path=SAMPLE_LIST_PATH)
>>> samples._group_names
['SG19378659_257236339458_S001_GE1_1200', 'US91503671_253949442637_S01_GE1_105']
>>> samples._group_numbers
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
>>> samples.get_group_numbers(0)
[0, 1, 2, 3, 4]
>>> samples.get_group_numbers(group_name="US91503671_253949442637_S01_GE1_105")
[5, 6, 7, 8, 9, 10, 11, 12]
class teilab.datasets.TeiLabDataSets(verbose: bool = True)[source]

Bases: object

Utility Datasets Class for this lecture.

Parameters

verbose (bool, optional) – Whether print verbose or not. Defaults to True.

verbose

Whether print verbose or not. Defaults to True.

Type

bool

print

Print function.

Type

callable

sample

Datasts Samples.

Type

Samples

root

Root Directory for Datasets. ( DATA_DIR )

Type

Path

Examples

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
There are not enough datasets. Use ``.get_data`` to prepare all the required datasets.
>>> datasets.satisfied
False
>>> datasets.get_data(password="PASSWORD1")
>>> datasets.get_data(password="PASSWORD2")
>>> datasets.satisfied
True
TARGET_GeneName(str) ``GeneName`` of the target RNA (vimentin): str = 'VIM'

TARGET_GeneName (str) GeneName of the target RNA (vimentin)

TARGET_SystematicName: str = 'NM_003380'
ANNO_COLNAMES: List[str] = ['FeatureNum', 'ProbeName', 'SystematicName']

ANNO_COLNAMES (List[str]) Column names for annotation.

TARGET_COLNAME: str = 'gProcessedSignal'

TARGET_COLNAME (str) Column name for expression data.

init()[source]

Initialization

get_data(password: str) str[source]

Get data which is necessary for this lecture.

Parameters

password (str) – Password. (Because some data are ubpublished.)

Returns

The path to the downloaded file.

Return type

str

Examples

>>> from teilab.utils import TeiLabDataSets
>>> datasets = TeiLabDataSets()
>>> path = datasets.get_teilab_data(password="PASSWORD")
Try to get data from <SECRET_URL>
This is our unpublished data, so please treat it confidential.
[Download] URL: <SECRET_URL>
* Content-Encoding : None
* Content-Length   : 45.9 [MB]
* Content-Type     : application/zip
* Save Destination : PATH/TO/PASSWORD.zip
===== Progress =====
<SECRET FILENAME>   100.0%[####################] 45.3[s] 1.0[MB/s]  eta -0.0[s]
Save data at PATH/TO/PASSWORD.zip
[Unzip] Show file contents:
    * <SECRET_FILE_1>
    * <SECRET_FILE_2>
    * :
    * <SECRET_FILE_N>
>>> path
'PATH/TO/PASSWORD.zip'

Below is the code for the GAS(Google Apps Script) API server.

const P = PropertiesService.getScriptProperties();
const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName(P.getProperty("sheetname"))
const values = sheet.getRange("A2:C").getValues();

var Password2dataURL = {};
for (let i=0; i<values.length; i++){
  Password2dataURL[values[i][0]] = values[i].slice(1);
}

function doPost(e) {
  var password = e.parameter.password;
  var response = {
    message  : "Invalid Password",
    dataURL  : "",
    password : password
  };

  if (password in Password2dataURL){
    let data = Password2dataURL[password]
    response.dataURL = data[0]
    response.message = data[1]
  }

  var output = ContentService.createTextOutput();
  output.setMimeType(ContentService.MimeType.JSON);
  output.setContent(JSON.stringify(response));
  return output;
}

You can also get the data with the command like curl .

$ curl -L <GAS_WEBAPP_URL> \
       -d password=<PASSWORD> \
       -H "Content-Type: application/x-www-form-urlencoded"
get_filePaths() nptyping.types._ndarray.NDArray[Any, nptyping.types._object.Object][source]

Get the path list of files used in the lecture.

Returns

The path lists for datasets.

Return type

NDArray[Path]

Examples

>>> from teilab.utils import TeiLabDataSets
>>> datasets = TeiLabDataSets()
>>> filelists = datasets.get_filePaths()
>>> len(filelists)
13
>>> filelists[0].name
'US91503671_253949442637_S01_GE1_105_Dec08_1_1.txt'
property filePaths: nptyping.types._ndarray.NDArray[Any, nptyping.types._object.Object]

The path lists for datasets.

property satisfied: bool

Whether to get all necessary data or not.

read(no: Union[int, str, List[int]], sep: Optional[str] = '\t', header: Union[int, List[int]] = 'infer', nrows: Optional[int] = None, usecols: Optional[Union[List[str], callable]] = None, **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]][source]

Read sample(s) data as pd.DataFrame

Parameters
  • no (Union[int,str,List[int]]) – Target sample number(s) or "all" .

  • sep (Optional[str], optional) – Delimiter to use. Defaults to "\t"

  • header (Union[int,List[int]], optional) – Row number(s) to use as the column names, and the start of the data. Defaults to "infer".

  • nrows (Optional[int], optional) – Number of rows of file to read. Useful for reading pieces of large files. Defaults to None.

  • usecols (Optional[Union[List[str],callable]], optional) – Return a subset of the columns. Defaults to None.

  • **kwargs (dict) – Other keyword arguments for pd.read_csv .

Raises

TypeError – When argument no is an instance of unexpected type or is an unexpected value.

Returns

DataFrame of the specified sample(s).

Return type

Union[pd.DataFrame, List[pd.DataFrame]]

Examples

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> dfs = datasets.read(no="all", header=9)
>>> len(dfs)
13
>>> type(dfs[0])
pandas.core.frame.DataFrame
read_data(no: Union[int, str, List[int]], **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]][source]

Read sample(s) ‘expression’ data as pd.DataFrame

Parameters

no (Union[int,str,List[int]]) – Target sample number(s) or "all" .

Returns

DataFrame of the specified sample(s) ‘expression’ data.

Return type

Union[pd.DataFrame, List[pd.DataFrame]]

Examples

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> dfs = datasets.read_data(no=[0,1,2])
>>> len(dfs)
3
>>> len(dfs[0])
62976
read_meta(no: Union[int, str, List[int]], **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]][source]

Read sample(s) ‘meta’ data as pd.DataFrame

Parameters

no (Union[int,str,List[int]]) – Target sample number(s) or "all" .

Returns

DataFrame of the specified sample(s) ‘meta’ data.

Return type

Union[pd.DataFrame, List[pd.DataFrame]]

Examples

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> dfs = datasets.read_meta(no="all")
>>> len(dfs)
13
>>> len(dfs[0])
1
read_summary(no: Union[int, str, List[int]], **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]][source]

Read sample(s) ‘summary’ data as pd.DataFrame

Parameters

no (Union[int,str,List[int]]) – Target sample number(s) or "all" .

Returns

DataFrame of the specified sample(s) ‘summary’ data.

Return type

Union[pd.DataFrame, List[pd.DataFrame]]

Examples

>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> dfs = datasets.read_summary(no=[0,3,8,9])
>>> len(dfs)
4
>>> len(dfs[0])
1
static reliable_filter(df: pandas.core.frame.DataFrame, name: Optional[str] = None) pandas.core.frame.DataFrame[source]

Create a dataframe which means whether data is reliable or not.

Parameters
  • df (pd.DataFrame) – Input dataframe.

  • name (Optional[str], optional) – The column name. Defaults to None.

Returns

Filter DataFrame which means whether data is reliable or not.

Return type

pd.DataFrame

Examples

>>> import pandas as pd
>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> df_sg = datasets.read_data(0)
>>> len(df_sg), datasets.reliable_filter(df_sg).sum().values[0]
(62976, 30385)
>>> df_us = datasets.read_data(-1)
>>> len(df_us), datasets.reliable_filter(df_us).sum().values[0]
(62976, 23434)