マイクロアレイデータ解析¶

〜マイクロアレイを用いた網羅的遺伝子発現解析〜

担当：程研究室
実習のwiki：http://bit.ly/microarray-2019
論文：Functional dissection of siRNA sequence by systematic DNA substitution: Modified siRNA with a DNA seed arm is a powerful tool for mammalian gene silencing with significantly reduced off-target effect

【Purpose】¶

　 seed match 以外のオフターゲット効果を探る。

すでにマイクロアレイ実習1でオフターゲット効果等を探り、"Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs" 等のモチーフ解析の結果同様、完全にマッチするmRNA（今回はビメンチン）以外にも、seedの2-8の領域とマッチするmRNAについても発現が有意に抑制されることが確認できた。では、mRNAの二次構造（bulgeなど）を考え、それらで抑制されている物を考える。

【Materials】¶

microarray¶

　今回の実験で用いたサンプルは、HeLa細胞に下記のsiRNA（化学合成したもの）を導入して24時間後に回収したものである。なお、"mock"は、核酸を加えずにトランスフェクション操作のみを行ったものである。なお、シード領域を緑色にした。

サンプル	配列（ガイド鎖が上）	seed配列(2-8)と相補的な配列	修飾
mock	なし	なし
RNA	5′-XXUUGAACUCGGUGUUGAUGGCG-3′ 3′-AGAACUUGAGCCACAACUACCXX-5′	5′-GAGTTCA-3′ 5′-GTTGATG-3′	なし
2OMe3	5′-XXUUGAACUCGGUGUUGAUGGCG-3′ 3′-AGAACUUGAGCCACAACUACCXX-5′	同上	赤字部分に2'OMe修飾
2OMe5	5′-XXUUGAACUCGGUGUUGAUGGCG-3′ 3′-AGAACUUGAGCCACAACUACCXX-5′	同上	赤字部分に2'OMe修飾
2OMe7	5′-XXUUGAACUCGGUGUUGAUGGCG-3′ 3′-AGAACUUGAGCCACAACUACCXX-5′	同上	赤字部分に2'OMe修飾
LNA3	5′-XXUUGAACUCGGUGUUGAUGGCG-3′ 3′-AGAACUUGAGCCACAACUACCXX-5′	同上	赤字部分にLNA修飾
LNA7	5′-XXUUGAACUCGGUGUUGAUGGCG-3′ 3′-AGAACUUGAGCCACAACUACCXX-5′	同上	赤字部分にLNA修飾

【Analysis】¶

In [1]:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:

plt.rcParams["font.size"] = 20

In [3]:

USE_COLS1 = [
    'FeatureNum',     # スポットの番号
    'ControlType',    # positive control は 1, negative control は -1 それ以外（解析で使う）は 0
    'ProbeName',      # プローブ名
    'GeneName',       # 遺伝子名
    'SystematicName', # 遺伝子名
]
USE_COLS2 = [
    'gProcessedSignal', # green(Cy-3)のシグナル強度（＝発現量）
    'gIsWellAboveBG',   # シグナルがバックより十分高いか？（＝信頼できるデータか）
]
samples = [
    'mock(1)',
    'RNA',
    '2OMe3',
    '2OMe5',
    'mock(5)',
    '2OMe7',
    'LNA3',
    'LNA7'
]

In [4]:

df = pd.read_excel("result.xlsx")

In [5]:

df.head()

Out[5]:

	FeatureNum	ProbeName	GeneName	SystematicName	mock(1)	RNA	2OMe3	2OMe5	mock(5)	2OMe7	LNA3	LNA7
0	4	A_23_P117082	HEBP1	NM_015987	12566.870	12356.9600	12896.7400	12860.2400	12416.270	13869.810	14213.960	13553.840
1	9	A_21_P0000509	SNAR-G2	NR_024244	122723.700	117958.9000	118623.0000	128495.1000	117447.400	124273.000	113802.500	107326.700
2	10	A_21_P0000744	LOC100506844	NR_038269	1234.146	1281.7390	1128.1320	1199.2970	1251.119	1141.723	1302.718	1264.150
3	11	A_24_P215804	CKLF	NM_016951	994.216	926.0554	859.4556	986.4944	1180.342	1029.824	1071.073	1138.738
4	12	A_23_P110167	MGST2	NM_002413	4842.234	4424.9150	4306.1000	4390.0340	4539.508	4453.469	4922.903	4665.866

In [6]:

num_all_data = len(df)
print(f"num all data: {num_all_data}")

num all data: 21511

Off-target Effect¶

ここから、オフターゲット効果について調べる。

そのため、ここからは修飾を加えていないRNAと、mock（誤差を減らすため平均値を利用）のデータのみに注目する。また、望んだ配列を持つmRNAは seedmatchというサイトで検索して調べる。

In [7]:

df["mock"] = df[["mock(1)", "mock(5)"]].mean(axis=1)
df["log2(RNA/mock)"] = np.log2(df["RNA"]/df["mock"])

　初めに、`mock` の分布を調べる。

In [8]:

# 外れ値があるので、閾値処理を行う。
mocks = df["mock"].values
threshold = sorted(mocks)[int(num_all_data*0.95)]
print(f"threshold: {threshold}")

threshold: 31822.555

In [9]:

fig, ax = plt.subplots(figsize=(12,8))
mocks_thresholded = np.where(mocks>threshold, threshold, mocks)
ax.hist(mocks_thresholded, density=True, bins=300)
ax.set_title("The distribution of `mock` Expression Level.")
ax.set_xlabel("Expression Level"), ax.set_ylabel("frequency")
ax.grid()
plt.show()

上の図形より、対数正規分布に従って発現量が観察されているように見える。

したがって、$\log_2\left(\text{Expression Level}\right)$ は正規分布に従っていると考えられる。（確率変数 $Y$ が正規分布に従うとき、$e^Y$ が従う分布を対数正規分布と言う。）

ここで、正規分布の差分も正規分布に従うので、$\log_2\left(\frac{\text{Expression Level}_{RNA}}{\text{Expression Level}_{mock}}\right)$ も正規分布に従うはずである。実際に確認すると、

In [10]:

fig, ax = plt.subplots(figsize=(12,8))
ge_ratio = df["log2(RNA/mock)"].values
ax.hist(ge_ratio, density=True, bins=300)
ax.set_title("The distribution of the ratio of log Expression Level.")
ax.set_xlabel("$\log_2(RNA / mock)$"), ax.set_ylabel("frequency")
ax.grid()
plt.show()

となる。（本来であれば、「シャピロ–ウィルク検定」や「コルモゴロフ-スミルノフ検定」を行うべきであるがここでは省略する。）

以上より、「指定した配列を持つmRNA」と「それ以外のmRNA」の2群の標本（発現量の比の対数）の平均に有意差があるかどうかをt検定で検定できる。

seedmatch¶

seedmatchで配列を取得して、その配列を持つmRNAとそれ以外のmRNAの発現量を比較するプログラムを作成する。

In [11]:

import re
import requests

def getSignals(df, array, genename_col="SystematicName", signal_colname="log2(RNA/mock)", plot=True, ax=None):
    array = str(array)
    df = df.copy().sort_values(by=signal_colname)
    
    # request data from seedmatch.
    ret     = requests.post('http://atlas.rnai.jp/seedmatch/seedmatch.cgi', data={"seedseq": array})
    df_mRNA = pd.DataFrame(re.findall(pattern=r'\n(.+)\t([0-9]+)', string=ret.text), columns=[genename_col, array])
    df      = pd.merge(df, df_mRNA, on=genename_col, how="left").fillna(0)
    signal_matches = df[df[array]!=0][signal_colname].values
    signal_others  = df[df[array]==0][signal_colname].values
    
    return (signal_matches, signal_others)

In [12]:

from scipy import stats

def t_test(x, y):
    t_statistic, p_value = stats.ttest_ind(x, y, equal_var=False)
    return p_value

In [13]:

def plot_seedmatch_result(signal_matches, signal_others, array="", ax=None):
    if ax is None:            
        fig, ax = plt.subplots(figsize=(12,8))
    num_matches = len(signal_matches)
    num_others  = len(signal_others)
    
    p_value=t_test(signal_matches, signal_others)
    
    ax.plot(signal_matches, [(i+1)/num_matches for i in range(num_matches)], label=f"matches: {num_matches}", color="red")
    ax.plot(signal_others,  [(i+1)/num_others  for i in range(num_others)],  label=f"others : {num_others}", color="blue")
    
    ax.set_title(f"array: {array}\np-value: {p_value:.7f}")
    ax.set_xlabel("$\log_{2}(RNA/mock)$"), ax.set_ylabel("Cumulative frequency")
    ax.vlines(x=0, ymin=0, ymax=1, colors='black', linewidths=2)
    ax.grid(), ax.legend()
    ax.set_xlim(-1,1)
    return ax

In [14]:

def compare_with_match(df, array, ax=None, ret_ax=False):
    signal_matches, signal_others = getSignals(df, array=array)
    ax = plot_seedmatch_result(signal_matches, signal_others, array=array, ax=ax)
    if ret_ax:
        return ax
    else:
        plt.show()

In [15]:

def plot_only_match(df, array, label="", threshold=1, ax=None, ret_ax=False):
    signal_matches, signal_others = getSignals(df, array=array)
    p_value = t_test(signal_matches, signal_others)
    if label == "":
        label = array
    if threshold>=p_value:
        num_matches = len(signal_matches)
        cum_freqs = [(i+1)/num_matches for i in range(num_matches)]
        ax.plot(signal_matches, cum_freqs, label=f"{label}: {num_matches} | p-value: {p_value:.5f}")
    if ret_ax:
        return ax
    else:
        plt.show()

In [16]:

def decorate_ax(ax, title, show=True):
    ax.grid()
    ax.legend(bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0)
    ax.set_title(title)
    ax.set_xlabel("$\log_{2}(RNA/mock)$"), ax.set_ylabel("Cumulative frequency")
    if show:
        plt.show()

プログラムの確認がてら、ガイド鎖の5'末端から2-8塩基目の領域であるseed領域と相補的な配列を持つmRNA群の抑制効果を調べる。

In [17]:

seed_comp = "GAGTTCA"

In [18]:

compare_with_match(df, array=seed_comp)

望み通りの結果となった。

続いて、7merというのは固定し、ガイド鎖の5'末端から走査して調べてみる。

In [19]:

guide_array   = "UUGAACUCGGUGUUGAUGGCG" # 5'-3'
g_match_array = "CGCCATCAACACCGAGTTCAA" # 5'-3'
length = len(guide_array)
k = 7

In [20]:

fig, ax = plt.subplots(figsize=(12,8))
for p in reversed(range(length+1)):
    array = g_match_array[p-k:p]
    ax = plot_only_match(
        df, array=array, ax=ax, ret_ax=True,
        label=f"{array}({length-p+1}-{length-p+k})",
    )
    if p==k: break

decorate_ax(ax, title="The difference between \n'7mer positions' and 'Expression Levels'", show=True)

2-8領域の有意性が示された。（1-7, 3-9 もp値が低く、有意に差が出ていると言えるが、3-7を含むと必然的に2-8を含む可能性が高まることによると考えられる。）

Extension¶

まず、siRNAのガイド鎖は「5′-XXUUGAACUCGGUGUUGAUGGCG-3′」であり、シード領域と相補的な配列はDNAの塩基で記述すると、「5′-GAGTTCA-3′」であった。ここで、この 7mer の両脇に来る塩基によって有意な差がないか調べる。

(仮説1) siRNAとの結合性：拡張した領域がsiRNAと相補的な配列を持つ場合、結合性が強まり、発現が抑制される。
(仮説2) RNA secondary structure：mRNAの相補的な配列が二次元構造を作りやすい場合、発現の抑制が阻害される。

In [21]:

nucleotide = ["A","C","G","T"]

In [22]:

fig, ax = plt.subplots(figsize=(12,8))
for nuc in nucleotide:
    array =     nuc + seed_comp    
    label = f"'{nuc}'{seed_comp}"
    ax = plot_only_match(df, array=array, label=label, ax=ax, ret_ax=True)
    
decorate_ax(ax, title="The difference between \nExtended nucleotide types (2-9)", show=True)

In [23]:

fig, ax = plt.subplots(figsize=(12,8))
for nuc in nucleotide:
    array =    seed_comp + nuc    
    label = f"{seed_comp}'{nuc}'"
    ax = plot_only_match(df, array=array, label=label, ax=ax, ret_ax=True)
    
decorate_ax(ax, title="The difference between \nExtended nucleotide types (1-8)", show=True)

(検証1) siRNAとの結合性：拡張した領域がsiRNAと相補的な配列を持つ場合（2-9: C, 1-8: A）「むしろ」発現の抑制が阻害されている。
(検証2) RNA secondary structure：mRNAの相補的な配列が二次元構造を作りやすい場合

(2-9)	(1-8)	(1-8)
"T"GAGTTCA	GAGTTCA"C"	GAGTTCA"A"
"("((..)))	..((..)")"	...((.)")"
抑制	抑制	阻害(検証1と重複)

以上より、見事に予想と逆の結果となった。

Insertion¶

続いて、シード領域と完全に相補的な配列ではなく、シード領域に対して「挿入」があると見なせる配列の場合を考える。

In [24]:

for i in range(1,7):
    fig, ax = plt.subplots(figsize=(12,8))
    for nuc in nucleotide:
        array = seed_comp[:i] + nuc + seed_comp[i:]
        label = seed_comp[:i] + f"'{nuc}'" + seed_comp[i:]
        ax = plot_only_match(df, array=array, label=label, ax=ax, ret_ax=True)
    decorate_ax(ax, title="The difference between \n'inserted nucleotide types' and 'Expression Levels'", show=True)

GAGTT'G'CAの曲線が気になる…。
GAGTTC'A'Aが有意に抑制されているように見えるが、これは単純にseed領域と相補的な7merを含んでいるからである。

Deletion¶

続いて、シード領域と完全に相補的な配列ではなく、シード領域に対して「欠失」があると見なせる配列の場合を考える。

In [25]:

fig, ax = plt.subplots(figsize=(12,8))
for i in range(1,6):
    array = seed_comp[:i] + seed_comp[i+1:]
    label = seed_comp[:i] + "-" + seed_comp[i+1:]
    ax = plot_only_match(df, array=array, label=label, ax=ax, ret_ax=True)

decorate_ax(ax, title="The difference between \n'deletion positions' and 'Expression Levels'", show=True)

欠失があると見なせる配列の場合、有意な違いは見られなかった。

bulge(insertion)¶

ここから、"RNA二次構造を踏まえた"オフターゲット効果について調べる。

siRNAのガイド鎖は「5′-XXUUGAACUCGGUGUUGAUGGCG-3′」であり、シード領域と相補的な配列はDNAの塩基で記述すると、「5′-GAGTTCA-3′」であった。

In [26]:

import itertools

def plot_bulges(df, length, threshold=1):    
    for i in range(1,7):
        fig, ax = plt.subplots(figsize=(12,8))
        for nucs in itertools.product(nucleotide, repeat=length):
            bulge = "".join(nucs)
            array = seed_comp[:i] + bulge + seed_comp[i:]
            label = seed_comp[:i] + f"'{bulge}'" + seed_comp[i:]
            ax = plot_only_match(df, array=array, label=label, threshold=threshold, ax=ax, ret_ax=True)
        decorate_ax(ax, title=f"The difference between \n'bulge nucleotide types' and 'Expression Levels'\nInserted Pos: {i+1}", show=True)

bulge length: 2¶

In [27]:

plot_bulges(df, length=2, threshold=0.1)

bulge length: 3¶

In [28]:

plot_bulges(df, length=3, threshold=0.01)

/Users/iwasakishuto/.pyenv/versions/anaconda3-5.0.1/lib/python3.6/site-packages/numpy/core/fromnumeric.py:3157: RuntimeWarning: Degrees of freedom <= 0 for slice
  **kwargs)
/Users/iwasakishuto/.pyenv/versions/anaconda3-5.0.1/lib/python3.6/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

bulge length: 4¶

In [29]:

plot_bulges(df, length=4, threshold=0.001)

In [ ]:

マイクロアレイ実習卒研

マイクロアレイデータ解析¶

【Purpose】¶

【Materials】¶

microarray¶

【Analysis】¶

Index

Off-target Effect¶

seedmatch¶

Extension¶

Insertion¶

Deletion¶

bulge(insertion)¶

bulge length: 2¶

bulge length: 3¶

bulge length: 4¶

Table of Contents

Published

Last Updated

Category

Tags

Contact

Other contents