Problem Setting
Implement the "SCODE algorithm" to find out the regulatory relationships of genes in single cells.
SCODE¶
- Link:
- developed the novel and efficient algorithm SCODE to infer regulatory networks, based on ordinary differential equations.
- The performance and runtimes of SCODE was SOTA.
- provides a promising approach for further single-cell differentiation analyses.
Algorithms¶
In the paper above, they focus on TFs and inferring TF regulatory networks.
1. Describing regulatory networks and expression dynamics with linear ODEs¶
First, they describe TF expression dynamics throughout differentiation with linear ODEs:
$$\frac{d\mathbf{x}_t}{dt} = \mathbf{Ax}_t,\quad \mathbf{x}_t\in\mathbb{R}^G,\mathbf{A}\in\mathbb{R}^{G\times G}\quad\cdots (1)$$where $G$ is the number of TFs. The observed expression data consist of a $G \times C$ matrix $\left(\mathbf{X}(e)\right)$, where $C$ is the number of cells. In addition, the time parameter of a cell $c$ is given as $t_c$.
1.1. Deriving $\mathbf{A}$ from a linear ODE transformation¶
At first, they consider the following linear ODE:
$$d\mathbf{z} = \mathbf{Bz}dt\quad\cdots (2)$$where $\mathbf{z}$ is a vector of length $G$ and $\mathbf{B}$ is a known square matrix. If we know a matrix $\mathbf{W}$ that satisfies $\mathbf{x} = \mathbf{Wz}$, we can derive the ODE of $\mathbf{x}$ by transforming the ODE of $\mathbf{z}$ as follows:
$$d\mathbf{z} = \mathbf{Bz}dt\quad\cdots (3)$$$$ \begin{aligned} d\mathbf{z} &= \mathbf{BW}^{-1}\mathbf{Wz}dt\\ \mathbf{W}d\mathbf{z} &= \mathbf{WBW}^{-1}\mathbf{Wz}dt\\ d\mathbf{x} &= \underbrace{\mathbf{WBW}^{-1}}_{\mathbf{A}}\mathbf{x}dt \end{aligned} $$Therefore, if the parameter $\mathbf{B}$ of $d\mathbf{z} = \mathbf{Bz}dt$ and the relationship $\mathbf{x} = \mathbf{Wz}$ are given, we can derive $\mathbf{A}$ from $\mathbf{WBW}^{-1}$.
1.2. Estimating $\mathbf{W}$ using linear regression¶
They assume that the problem of $\mathbf{W}$ inference can be regarded as a linear regression problem for each gene, as follows:
$$\mathbf{X}_{gc}^{(e)} = \sum_{i=1}^G \mathbf{W}_{gi}\mathbf{Z}_{ic}^{(e)} + \boldsymbol{\varepsilon}\quad\cdots (4)$$where $\boldsymbol{\varepsilon}$ is a noise term. Therefore, $\mathbf{W}$ can be optimized analytically and efficiently by linear regression for each TF.
1.3. Dimension reduction of $\mathbf{z}$¶
The basic idea of reduction is that the patterns of expression dynamics are limited and expression dynamics can be reconstructed with a small number of patterns.
For the next step, they consider a small vector $\mathbf{z}$ to represent the original expression dynamics. Hereafter, $\mathbf{z}$ is a vector of length $D$, with $D\ll G$.
In this case, $\mathbf{W}$ is a $G\times D$ matrix, and hence we used a pseudo-inverse matrix $\mathbf{W}^{+}$ instead of the inverse matrix, and $\mathbf{A}$ is derived from $\mathbf{A}=\mathbf{WBW}^{+}$. The matrix $\mathbf{W}$ is estimated as before, via linear regression. By using a small vector $\mathbf{z}$, the time complexity of estimation of $\mathbf{W}$ becomes much lower.
1.4 Optimizing $\mathbf{B}$¶
Thus far, we have assumed $\mathbf{B}$ is given. To represent the original expression dynamics with small values of $D$, we optimize $\mathbf{B}$ for the next step.
We suppose that the appropriate value of $\mathbf{B}$ satisfes the condition that the $\mathbf{Z}^{(e)}$ generated from $d\mathbf{z}=\mathbf{Bz}dt$ can predict $\mathbf{X}^{(e)}$ with $\mathbf{WZ}^{(e)}$ accurately.
$$\frac{d\mathbf{z}_t}{dt} = \mathbf{Bz}_t\longrightarrow \mathbf{z}_t = \left(e^{b_1t},\ldots,e^{b_Dt}\right)$$Therefore, we evaluate the appropriateness of the matrix $\mathbf{B}$ with the following residual sum of squares (RSS):
$$\text{RSS}\left(\mathbf{B},\mathbf{W}\right) = \sum_{g,c}\left(\mathbf{X}^{(e)}_{gc} - \sum_{i=1}^D\mathbf{W}_{gi}\mathbf{Z}_{ic}^{(e)}\right)\quad\cdots (5)$$In this research, they assume $\mathbf{B}$ is a diagonal matrix and the elements $\mathbf{B}_{ii}$ satisfy $b_{\min}\leq \mathbf{B}_{ii}\leq b_{\max}$. This limitation is acceptable because large and small values of $\mathbf{B}_{ii}$ represent a dynamics of sharp change and seem to be an inefficient basis for reconstructing the expression dynamics.
They optimize $\mathbf{B}$ by random sampling and iterative optimization so that the RSS decreases. The code is given from "kerasy/Bio/scode.py", and the brief pseudocode is given below.
B = initialize()
for k in range(iter_max):
# Generate from dz = B^{(k)}zdt
Z = generateZ(B)
# Solution of linear regression (X^{(e)}≃WZ^{(e)})
W = optimize(X,Z)
if RSS(B,W) < RSS(B_best,W_best):
B_best = B
W_best = W
else:
B = B_best
# Uniform random value [1,D]
i = np.random.randint(D)
B[i,i] = np.random.uniform(low=b_min, high=b_max, size=1)
1.5. Regularized Least Square¶
It is likely to over-fit especially when model is too complicated. For avoiding it, we introduce the idea of adding a regularization term to an error function.
$$\text{RSS}\left(\mathbf{W},\mathbf{B}\right) + \lambda \left\|\mathbf{W}\right\|^2$$Then, the total error function and Maximum likelihood estimated $\mathbf{W}$ is expressed as
$$ \begin{aligned} E\left(\mathbf{W}\right) &= \text{RSS}\left(\mathbf{W},\mathbf{B}\right) + \lambda \left\|\mathbf{W}\right\|^2\\ &= \sum_{g=1}^G\sum_{c=1}^C\left(x_{gc}^{\text{obs}} - \sum_{d=1}^DW_{gd}z_{t_cd}\right)^2 + \lambda\sum_{g=1}^G\sum_{d=1}^DW_{gd}^2\\ &= \text{Tr}\left\{\left(\mathbf{X}^{\text{obs}} - \mathbf{WZ}^T\right)^T\left(\mathbf{X}^{\text{obs}} - \mathbf{WZ}^T\right) + \lambda\mathbf{W}^T\mathbf{W}\right\}\\ \frac{\partial E\left(\mathbf{W}\right)}{\partial W_{gd}}&= -2\sum_{c=1}^Cz_{t_cd}\left(x_{gc}^{\text{obs}}\ - \sum_{d^{\prime}=1}^DW_{gd}z_{t_cd^{\prime}}\right) + 2\lambda W_{gd} = 0\\ 0 &= -2\left\{\sum_{c=1}^Cx_{gc}^{\text{obs}}z_{t_cd} - \sum_{d^\prime=1}^DW_{gd^{\prime}}\left(\sum_{c=1}^Cz_{t_cd^{\prime}}z_{t_cd} + \lambda\right)\right\}\\ &= -2 \left\{\mathbf{X}^{\text{obs}}\mathbf{Z} - \mathbf{W}\left(\mathbf{Z}^T\mathbf{Z} + \lambda \mathbf{I}\right)\right\}_{gd}\\ \therefore\mathbf{W}_{\text{ML}} &= \mathbf{X}^{\text{obs}}\mathbf{Z}\left(\mathbf{Z}^T\mathbf{Z} + \lambda \mathbf{I}\right)^{-1} \end{aligned} $$Implementation¶
data¶
import numpy as np
with open("singlecell/tf.txt", mode="r") as f:
tf_names = [name.rstrip("\n") for name in f.readlines()]
print(f"TF Name types: {len(tf_names)}")
with open("singlecell/expr.txt", mode="r") as f:
expressions = np.asarray([
line.rstrip("\n").split("\t") for line in f.readlines()
], dtype=float)
num_tf, num_cells = expressions.shape
print(f"the number of tf : {num_tf}")
print(f"the number of cells: {num_cells}")
with open("singlecell/time_normalized.txt", mode="r") as f:
time_normalized = np.asarray([
line.rstrip("\n").split("\t")[-1] for line in f.readlines()
], dtype=float)
print(f"the number of cells: {len(time_normalized)}")
Algorithm¶
from kerasy.Bio.scode import SCODE
D = 4
lamda = 1e-2
b_min = -10
b_max = 10
seed = 2020
max_iter = 1000
model = SCODE(lamda=lamda, b_min=b_min, b_max=b_max, random_state=seed)
model.fit(X=expressions, t=time_normalized, dimension=D, max_iter=max_iter, verbose=1)
model.show_corr(10, gene_name=tf_names)
Observe how dimension affects the results
RSSs = []
dimensions = np.arange(1,10)
for d in dimensions:
print(f"\nDimension: {d}")
model = SCODE(lamda=lamda, b_min=b_min, b_max=b_max, random_state=seed)
model.fit(X=expressions, t=time_normalized, dimension=d, max_iter=max_iter, verbose=-1)
model.show_corr(5, gene_name=tf_names)
RSSs.append(model.RSS)
import matplotlib.pyplot as plt
plt.plot(dimensions, RSSs, color="red")
plt.title("The relationship between\ndimensions and RSS", fontsize=16)
plt.xlabel("dimensions")
plt.ylabel("RSS")
plt.grid()
plt.show()
According to Elbow Law, dimension=3
looks best.
best_dimension = 3
model = SCODE(lamda=lamda, b_min=b_min, b_max=b_max, random_state=seed)
model.fit(X=expressions, t=time_normalized, dimension=best_dimension, max_iter=max_iter, verbose=-1)
model.plot_corr()
reg, reged = model.top_pairs()
time = sorted(time_normalized)
expressions_pred = model.predict(time)
model.show_corr(1, gene_name=tf_names)
def plot_expression_level(idx, text=""):
plt.title(f"The relationship between \nPredicted and Answer ({tf_names[idx]}{text})")
plt.plot(time, expressions_pred[idx], color="red", label="Pred expression level")
plt.scatter(time_normalized, expressions[idx], color="blue", label="Ans expression level", s=1)
plt.legend()
plt.show()
plot_expression_level(reg, text=", regulate")
plot_expression_level(reged, text=", regulated")
Relationship between EOMES
and ZNF516
¶
- EOMES
This gene belongs to the TBR1 (T-box brain protein 1) sub-family of T-box genes that share the common DNA-binding T-box domain. The encoded protein is a transcription factor which is crucial for embryonic development of mesoderm and the central nervous system in vertebrates. The protein may also be necessary for the differentiation of effector CD8+ T cells which are involved in defense against viral infections. A similar gene disrupted in mice is shown to be essential during trophoblast development and gastrulation. Alternative splicing results in multiple transcript variants. [provided by RefSeq, May 2013]
- ZNF516
Zinc-finger proteins bind nucleic acids and play important roles in various cellular functions, including cell proliferation, differentiation, and apoptosis. This gene encodes a zinc-finger protein, and belongs to the krueppel C2H2-type zinc-finger protein family. It may be involved in transcriptional regulation. [provided by RefSeq, Sep 2012]