teilab.datasets module¶
This module handles Datasets used in the lecture. The table below describes the meaning of the values in each column of the data used in the lecture. If you want to refer each class or method, click HERE
to skip it.
Reference: Agilent Feature Extraction 12.0 Reference Guide
FEATURES¶
>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=True)
>>> df_data = datasets.read_data(no=0)
>>> df_data.columns
Index(['FEATURES', 'FeatureNum', 'Row', 'Col', 'accessions', ... ], dtype='object')
G2505C |
G2600D |
Features |
Types |
Description |
---|---|---|---|---|
1 |
1 |
|
int |
Feature number. |
2 |
2 |
|
int |
Feature location: row. |
3 |
3 |
|
int |
Feature location: column. |
4 |
|
str |
Gene accession numbers. |
|
5 |
|
str |
Chromosome coordinates of the feature. |
|
6 |
4 |
|
int |
Numeric code defining the subtype of any control feature. |
7 |
|
int |
Name of the subtype of any control feature. |
|
8 |
|
int |
Indicates the place in the transcript where the probe sequence starts. |
|
9 |
|
str |
The sequence of bases printed on the array. |
|
10 |
|
int |
Unique integer for each unique probe in a design. |
|
11 |
5 |
|
int |
Feature control type. |
12 |
6 |
|
str |
An Agilent-assigned identifier for the probe synthesized on the microarray. |
13 |
|
str |
This is an identifier for the gene for which the probe provides expression information. The target sequence identified by the systematic name is normally a representative or consensus sequence for the gene. |
|
14 |
7 |
|
str |
This is an identifier for the target sequence that the probe was designed to hybridize with. Where possible, a public database identifier is used (e.g., TAIR locus identifier for Arabidopsis). |
15 |
|
str |
Description of gene. |
|
16 |
8 |
|
float |
Found coordinates (X) of the feature centroid in microns. |
17 |
9 |
|
float |
Found coordinates (Y) of the feature centroid in microns. |
18 |
|
float |
If |
|
19 |
|
bool |
A boolean used to flag found features. The flag is applied independently in each channel. ( |
|
20 |
10 |
|
float |
The signal left after all the Feature Extraction processing steps have been completed. In the case of one color, |
21 |
11 |
|
float |
The universal or propagated error left after all the processing steps of Feature Extraction have been completed. In the case of one color, |
22 |
|
int |
Number of outlier pixels per feature with intensity > upper threshold set via the pixel outlier rejection method. The number is computed independently in each channel. These pixels are omitted from all subsequent calculations. |
|
23 |
|
int |
Number of outlier pixels per feature with intensity < lower threshold set via the pixel outlier rejection method. The number is computed independently in each channel. These pixels are omitted from all subsequent calculations. NOTE: The pixel outlier method is the ONLY step that removes data in Feature Extraction. |
|
24 |
|
int |
Total number of pixels used to compute feature statistics; i.e. total number of inlier pixels/per spot; same in both channels. |
|
25 |
|
float |
Raw mean signal of feature from inlier pixels in green and/or red channel. |
|
26 |
12 |
|
float |
Raw median signal of feature from inlier pixels in green and/or red channel. |
27 |
|
float |
Standard deviation of all inlier pixels per feature; this is computed independently in each channel. |
|
28 |
|
float |
The normalized Inter-quartile range of all of the inlier pixels per feature. The range is computed independently in each channel. |
|
29 |
|
int |
Total number of pixels used to compute local BG statistics per spot; i.e. total number of BG inlier pixels; same in both channels. |
|
30 |
25 |
|
float |
Mean local background signal (local to corresponding feature) computed per channel (inlier pixels). |
31 |
13 |
|
float |
Median local background signal (local to corresponding feature) computed per channel (inlier pixels). |
32 |
14 |
|
float |
Standard deviation of all inlier pixels per local BG of each feature, computed independently in each channel. |
33 |
|
float |
The normalized Inter-quartile range of all of the inlier pixels per local BG of each feature. The range is computed independently in each channel. |
|
34 |
|
int |
Total number of saturated pixels per feature, computed per channel. |
|
35 |
15 |
|
bool |
Boolean flag indicating if a feature is saturated or not. A feature is saturated. IF 50% of the pixels in a feature are above the saturation threshold. ( |
36 |
16 |
|
bool |
Boolean flag indicating if a feature is a NonUniformity Outlier or not. A feature is non-uniform if the pixel noise of feature exceeds a threshold established for a “uniform” feature. |
37 |
17 |
|
bool |
The same concept as above but for background. |
38 |
18 |
|
bool |
Boolean flag indicating if a feature is a Population Outlier or not. Probes with replicate features on a microarray are examined using population statistics. A feature is a population outlier if its signal is less than a lower threshold or exceeds an upper threshold determined using a multiplier (1.42) times the interquartile range (i.e., IQR) of the population. |
39 |
19 |
|
bool |
The same concept as above but for background. |
40 |
20 |
|
bool |
Boolean to flag features for downstream filtering in third party gene expression software. |
41 |
21 |
|
float |
Background-subtracted signal. To display the values used to calculate this variable using different background signals and settings of spatial detrend and global background adjust, see Table 34 on page 254 . |
42 |
|
float |
Propagated standard error as computed on net g(r) background-subtracted signal. For one color, the error model is applied to the background-subtracted signal. This will contain the larger of he universal (UEM) error or the propagated error. |
|
43 |
22 |
|
bool |
Boolean flag, established via a 2-sided t-test, indicates if the mean signal of a feature is greater than the corresponding background (selected by user) and if this difference is significant. |
44 |
|
float |
pValue from t-test of significance between g(r)Mean signal and g(r) background (selected by user). |
|
45 |
|
int |
Number of local background regions or features used to calculate the background used for background subtraction on this feature. |
|
46 |
23 |
|
bool |
Boolean flag indicating if a feature is WellAbove Background or not, feature passes |
47 |
|
float |
Background used to subtract from the MeanSignal; variable also used in t-test. To display the values used to calculate this variable using different background signals and settings of spatial detrend and global background adjust, see Table 34 on page 254 . |
|
48 |
|
float |
Standard deviation of background used in g(r) channel; variable also used in t-test and surrogate algorithms. To display the values used to calculate this variable using different background signals and settings of spatial detrend and global background adjust, see Table 34 on page 254 . |
|
49 |
|
bool |
Indicates the error model that you chose for Feature Extraction or that the software uses if you have chosen the “Most Conservative” option. |
|
50 |
|
bool |
Set to true for a given feature if it is part of the filtered set used to detrend the background. This feature is considered part of the locally weighted lowest x% of features as defined by the DetrendLowPassPercentage. |
|
51 |
|
float |
Value of the smoothed surface calculated by the Spatial detrend algorithm. |
|
52 |
24 |
|
float |
Diameter of the spot (X-axis). |
53 |
|
float |
Diameter of the spot (Y-axis). |
|
54 |
|
float |
|
|
55 |
|
float |
A surface is fitted through the log of the background-subtracted signal to look for multiplicative gradients. A normalized version of that surface interpolated at each point of the microarray is stored in |
|
56 |
|
float |
Indicates the Background signal that was selected to be used (Mean or Median). |
|
57 |
|
float |
Indicates the Background error that was selected to be used (PixSD or NormIQR) . |
|
58 |
|
bool |
A Boolean used to flag features used for computation of global BG offset. |
|
59 |
|
float |
Value at the polynomial fit of the negative controls. |
|
60 |
|
bool |
Set to true for a given feature if its signal intensity is in the negative control range. |
|
61 |
|
bool |
Indicates whether this feature was included in the set used to generate the multiplicative detrend surface. |
Background Subtraction¶
The feature background-subtracted signal, BGSubSignal
, is calculated by subtracting a value called the BGUsed
from the feature mean signal.
where BGSubSignal
and BGUsed
depend on the type of background method and the settings for spatial detrend and global background adjust. See the following table.
Background Subtraction Method |
Background Subtraction Variable |
SpDe OFF / GBA OFF |
SpDe ON / GBA OFF |
SpDe OFF / GBA ON |
SpDe ON / GBA ON |
---|---|---|---|---|---|
No background subtract |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Background |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Global Background method |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SpDe: Spatial Detrend
SDSV:
SpatialDetrendSurfaceValue
GBA: Global Bkgnd Adjust
GBGISD:
GlobalBGInlierSDev
GBGISD:
GlobalBGInlierSDev
>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> df_meta = datasets.read_meta(no=-1)
>>> df_meta["BGSubtractor_BGSubMethod"].values[0]
7
The BGSubMethod of 7
corresponds to “No Background Subtraction method” (see Table 17 on page 129 .).
Global Background Adjustment is turned Off.
Spatial Detrending is turned On.
>>> from teilab.datasets import TeiLabDataSets
>>> datasets = TeiLabDataSets(verbose=False)
>>> df_data = datasets.read_data(no=-1)
>>> all(df_data["gBGUsed"] == df_data["gSpatialDetrendSurfaceValue"])
True
>>> all(abs((df_data["gBGSDUsed"]-df_data["gBGPixSDev"])/df_data["gBGSDUsed"])<1e-3)
True
>>> all(abs((df_data["gBGSubSignal"]-(df_data["gMeanSignal"]-df_data["gBGUsed"]))/df_data["gBGSubSignal"]) < 1.4e-1)
True
- class teilab.datasets.Samples(sample_list_path: str)[source]¶
Bases:
object
Utility Sample Class for this lecture.
- _df¶
Sample information described in text file at
sample_list_path
.- Type
pd.DataFrame
- SampleNumber¶
Index numbers for each sample.
- Type
NDArray[Any,str]
- FileName¶
File namse for each sample.
- Type
NDArray[Any,str]
- Condition¶
Experimental conditions for each sample.
- Type
NDArray[Any,str]
- _group_names¶
i-th
group’s filename prefix.- Type
NDArray[Any,str]
- _group_numbers¶
Which group (
no
) thej-th
sample belongs to- Type
NDArray[Any,np.uint8]
Examples
>>> from teilab.datasets import Samples >>> from teilab.utils._path import SAMPLE_LIST_PATH >>> samples = Samples(sample_list_path=SAMPLE_LIST_PATH) >>> print(sorted(samples.__dict__.keys())) ['Condition', 'FileName', 'SampleNumber', '_df', '_group_names', '_group_numbers']
- property groups¶
- show_groups(tablefmt: str = 'simple') None [source]¶
Show groups neatly.
- Parameters
tablefmt (str, optional) – Table formats. Please choose from [
"plain"
,"simple"
,"grid"
,"pipe"
,"orgtbl"
,"rst"
,"mediawiki"
,"latex"
,"latex_raw"
,"latex_booktabs"
,"latex_longtable"
,"tsv"
] . Defaults to"simple"
.
Examples
>>> from teilab.datasets import Samples >>> from teilab.utils._path import SAMPLE_LIST_PATH >>> samples = Samples(sample_list_path=SAMPLE_LIST_PATH) >>> samples.show_groups() idx gn GroupName FileName ----- ---- ------------------------------------- --------------------------------------------------- 0 0 SG19378659_257236339458_S001_GE1_1200 SG19378659_257236339458_S001_GE1_1200_Jun14_1_1.txt 1 0 SG19378659_257236339458_S001_GE1_1200 SG19378659_257236339458_S001_GE1_1200_Jun14_1_2.txt : : : : 11 1 US91503671_253949442637_S01_GE1_105 US91503671_253949442637_S01_GE1_105_Dec08_1_4.txt 12 1 US91503671_253949442637_S01_GE1_105 US91503671_253949442637_S01_GE1_105_Dec08_2_2.txt
- get_group_numbers(group_no: Optional[int] = None, group_name: Optional[str] = None) List[int] [source]¶
Get the specified group index List.
- Parameters
group_no (Optional[int], optional) – [description]. Defaults to
None
.group_name (Optional[str], optional) – [description]. Defaults to
None
.
- Returns
[description]
- Return type
List[int]
Examples
>>> from teilab.datasets import Samples >>> from teilab.utils._path import SAMPLE_LIST_PATH >>> samples = Samples(sample_list_path=SAMPLE_LIST_PATH) >>> samples._group_names ['SG19378659_257236339458_S001_GE1_1200', 'US91503671_253949442637_S01_GE1_105'] >>> samples._group_numbers [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1] >>> samples.get_group_numbers(0) [0, 1, 2, 3, 4] >>> samples.get_group_numbers(group_name="US91503671_253949442637_S01_GE1_105") [5, 6, 7, 8, 9, 10, 11, 12]
- class teilab.datasets.TeiLabDataSets(verbose: bool = True)[source]¶
Bases:
object
Utility Datasets Class for this lecture.
- Parameters
verbose (bool, optional) – Whether print verbose or not. Defaults to
True
.
- verbose¶
Whether print verbose or not. Defaults to
True
.- Type
bool
- print¶
Print function.
- Type
callable
- root¶
Root Directory for Datasets. (
DATA_DIR
)- Type
Path
Examples
>>> from teilab.datasets import TeiLabDataSets >>> datasets = TeiLabDataSets(verbose=False) There are not enough datasets. Use ``.get_data`` to prepare all the required datasets. >>> datasets.satisfied False >>> datasets.get_data(password="PASSWORD1") >>> datasets.get_data(password="PASSWORD2") >>> datasets.satisfied True
- TARGET_GeneName(str) ``GeneName`` of the target RNA (vimentin): str = 'VIM'¶
TARGET_GeneName (str)
GeneName
of the target RNA (vimentin)
- TARGET_SystematicName: str = 'NM_003380'¶
- ANNO_COLNAMES: List[str] = ['FeatureNum', 'ProbeName', 'SystematicName']¶
ANNO_COLNAMES (List[str]) Column names for annotation.
- TARGET_COLNAME: str = 'gProcessedSignal'¶
TARGET_COLNAME (str) Column name for expression data.
- get_data(password: str) str [source]¶
Get data which is necessary for this lecture.
- Parameters
password (str) – Password. (Because some data are ubpublished.)
- Returns
The path to the downloaded file.
- Return type
str
Examples
>>> from teilab.utils import TeiLabDataSets >>> datasets = TeiLabDataSets() >>> path = datasets.get_teilab_data(password="PASSWORD") Try to get data from <SECRET_URL> This is our unpublished data, so please treat it confidential. [Download] URL: <SECRET_URL> * Content-Encoding : None * Content-Length : 45.9 [MB] * Content-Type : application/zip * Save Destination : PATH/TO/PASSWORD.zip ===== Progress ===== <SECRET FILENAME> 100.0%[####################] 45.3[s] 1.0[MB/s] eta -0.0[s] Save data at PATH/TO/PASSWORD.zip [Unzip] Show file contents: * <SECRET_FILE_1> * <SECRET_FILE_2> * : * <SECRET_FILE_N> >>> path 'PATH/TO/PASSWORD.zip'
Below is the code for the GAS(Google Apps Script) API server.
const P = PropertiesService.getScriptProperties(); const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName(P.getProperty("sheetname")) const values = sheet.getRange("A2:C").getValues(); var Password2dataURL = {}; for (let i=0; i<values.length; i++){ Password2dataURL[values[i][0]] = values[i].slice(1); } function doPost(e) { var password = e.parameter.password; var response = { message : "Invalid Password", dataURL : "", password : password }; if (password in Password2dataURL){ let data = Password2dataURL[password] response.dataURL = data[0] response.message = data[1] } var output = ContentService.createTextOutput(); output.setMimeType(ContentService.MimeType.JSON); output.setContent(JSON.stringify(response)); return output; }
You can also get the data with the command like
curl
.$ curl -L <GAS_WEBAPP_URL> \ -d password=<PASSWORD> \ -H "Content-Type: application/x-www-form-urlencoded"
- get_filePaths() nptyping.types._ndarray.NDArray[Any, nptyping.types._object.Object] [source]¶
Get the path list of files used in the lecture.
- Returns
The path lists for datasets.
- Return type
NDArray[Path]
Examples
>>> from teilab.utils import TeiLabDataSets >>> datasets = TeiLabDataSets() >>> filelists = datasets.get_filePaths() >>> len(filelists) 13 >>> filelists[0].name 'US91503671_253949442637_S01_GE1_105_Dec08_1_1.txt'
- property filePaths: nptyping.types._ndarray.NDArray[Any, nptyping.types._object.Object]¶
The path lists for datasets.
- property satisfied: bool¶
Whether to get all necessary data or not.
- read(no: Union[int, str, List[int]], sep: Optional[str] = '\t', header: Union[int, List[int]] = 'infer', nrows: Optional[int] = None, usecols: Optional[Union[List[str], callable]] = None, **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]] [source]¶
Read sample(s) data as
pd.DataFrame
- Parameters
no (Union[int,str,List[int]]) – Target sample number(s) or
"all"
.sep (Optional[str], optional) – Delimiter to use. Defaults to
"\t"
header (Union[int,List[int]], optional) – Row number(s) to use as the column names, and the start of the data. Defaults to
"infer"
.nrows (Optional[int], optional) – Number of rows of file to read. Useful for reading pieces of large files. Defaults to
None
.usecols (Optional[Union[List[str],callable]], optional) – Return a subset of the columns. Defaults to
None
.**kwargs (dict) – Other keyword arguments for
pd.read_csv
.
- Raises
TypeError – When argument
no
is an instance of unexpected type or is an unexpected value.- Returns
DataFrame of the specified sample(s).
- Return type
Union[pd.DataFrame, List[pd.DataFrame]]
Examples
>>> from teilab.datasets import TeiLabDataSets >>> datasets = TeiLabDataSets(verbose=False) >>> dfs = datasets.read(no="all", header=9) >>> len(dfs) 13 >>> type(dfs[0]) pandas.core.frame.DataFrame
- read_data(no: Union[int, str, List[int]], **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]] [source]¶
Read sample(s) ‘expression’ data as
pd.DataFrame
- Parameters
no (Union[int,str,List[int]]) – Target sample number(s) or
"all"
.- Returns
DataFrame of the specified sample(s) ‘expression’ data.
- Return type
Union[pd.DataFrame, List[pd.DataFrame]]
Examples
>>> from teilab.datasets import TeiLabDataSets >>> datasets = TeiLabDataSets(verbose=False) >>> dfs = datasets.read_data(no=[0,1,2]) >>> len(dfs) 3 >>> len(dfs[0]) 62976
- read_meta(no: Union[int, str, List[int]], **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]] [source]¶
Read sample(s) ‘meta’ data as
pd.DataFrame
- Parameters
no (Union[int,str,List[int]]) – Target sample number(s) or
"all"
.- Returns
DataFrame of the specified sample(s) ‘meta’ data.
- Return type
Union[pd.DataFrame, List[pd.DataFrame]]
Examples
>>> from teilab.datasets import TeiLabDataSets >>> datasets = TeiLabDataSets(verbose=False) >>> dfs = datasets.read_meta(no="all") >>> len(dfs) 13 >>> len(dfs[0]) 1
- read_summary(no: Union[int, str, List[int]], **kwargs) Union[pandas.core.frame.DataFrame, List[pandas.core.frame.DataFrame]] [source]¶
Read sample(s) ‘summary’ data as
pd.DataFrame
- Parameters
no (Union[int,str,List[int]]) – Target sample number(s) or
"all"
.- Returns
DataFrame of the specified sample(s) ‘summary’ data.
- Return type
Union[pd.DataFrame, List[pd.DataFrame]]
Examples
>>> from teilab.datasets import TeiLabDataSets >>> datasets = TeiLabDataSets(verbose=False) >>> dfs = datasets.read_summary(no=[0,3,8,9]) >>> len(dfs) 4 >>> len(dfs[0]) 1
- static reliable_filter(df: pandas.core.frame.DataFrame, name: Optional[str] = None) pandas.core.frame.DataFrame [source]¶
Create a dataframe which means whether data is reliable or not.
- Parameters
df (pd.DataFrame) – Input dataframe.
name (Optional[str], optional) – The column name. Defaults to
None
.
- Returns
Filter DataFrame which means whether data is reliable or not.
- Return type
pd.DataFrame
Examples
>>> import pandas as pd >>> from teilab.datasets import TeiLabDataSets >>> datasets = TeiLabDataSets(verbose=False) >>> df_sg = datasets.read_data(0) >>> len(df_sg), datasets.reliable_filter(df_sg).sum().values[0] (62976, 30385) >>> df_us = datasets.read_data(-1) >>> len(df_us), datasets.reliable_filter(df_us).sum().values[0] (62976, 23434)