import numpy as np
import matplotlib.pyplot as plt
Section3.1 Multiple Test Correction¶
"The more inferences are made, the more likely erroneous inferences are to occur."(ex. Even if the type1 error rate is 1% for each data, 10 out of 1000 sample will be rejected on average.)
Terminology¶
- FWER(Family-wise error rate): the probability of making at least one type I error (FP) in the family.
- FDR(False discovery rate): the expected type I error (FP) rate in rejected data.
FDR(False discovery rate)¶
- Null hypotheses: $H_1,\ldots,H_{g_j}$
- Alternative hypotheses: $H_{g_j+1},\ldots,H_g$
- Procedure to reject hypotheses: $d_{\alpha}: \left(P_1,P_2,\ldots,P_g\right)\rightarrow\left\{H_{i_1},\ldots,H_{i_K}\right\}$
- Expected ratio of FP in rejected hypotheses: $\mathrm{FP}=\sum_{k=1}^K\mathbb{I}(1\leq i_k\leq g_j),\quad \mathrm{TP}=K-\mathrm{FP}$
- $\mathcal{Q} = \begin{cases}0 & \text{if }K=(\mathrm{FP+TP})=0\\\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TP}} & \text{if }K>0\end{cases}$ $$\mathrm{FDR}(d_{\alpha})=\mathbb{E}\left(\mathcal{Q}|H_1,\ldots,H_g\right)$$
Methods to counteract this problem¶
Bonferroni Correction¶
Set $\alpha\left(=s_{\alpha}\leq S_h|H_0\right)\leftarrow 1/g\cdot\alpha_{\mathrm{FWER}}$ ($g$: sample size)
Therefore, we have: $$\mathbb{P}\left(\{s_{\alpha}\leq S_1\}\cup\cdots\cup\{s_{\alpha}\leq S_g\}|H_0\right)\leq\sum_{h=1}^g\mathbb{P}\left(s_{\alpha}\leq S_h|H_0\right) = g\cdot\left(\frac{1}{g}\alpha_{\mathrm{FWER}}\right) = \alpha_{\mathrm{FWER}}$$
Benjamini-Hochberg Procedure (B-H Procedure)¶
If we assume data are distributed independently, we have: $$\left(P_1,\ldots,P_g\right)\sim\mathrm{Unif}\left(0,1\right)^{\times g_j}\times\mathbb{P}\left(P|H_{\text{alternative}}\right)^{\times\left(g-g_j\right)}$$ We sort the p-value as $p_1\leq\cdots\leq p_g$ Then, $$ \begin{aligned} d_{\alpha_{\mathrm{FDR}}}^{\mathrm{B-H}}(p_1,\cdots,p_g) &:=\mathrm{Reject}\left\{H_i|1\leq i\max\left\{n|p_n\leq\left(\frac{n}{g}\alpha_{\mathrm{FDR}}\right)\right\}\right\}\\ &\Rightarrow \mathrm{FDR}\left(d_{\alpha_{\mathrm{FDR}}^{\mathrm{B-H}}}\right)\leq\alpha_{\mathrm{FDR}} \end{aligned} $$
Example.¶
$H_i$ | $G_1$ | $G_2$ | $G_3$ | $G_4$ | $G_5$ | $G_6$ | $G_7$ | $G_8$ | $G_9$ | $G_{10}$ |
---|---|---|---|---|---|---|---|---|---|---|
$p_i$ | $0.1$ | $0.4$ | $0.8$ | $0.035$ | $0.9$ | $0.04$ | $0.001$ | $0.002$ | $0.6$ | $0.7$ |
alpha_FDR = 0.1
g = 10
idx = np.array([i+1 for i in range(g)])
p_value = data = np.array([0.1, 0.4, 0.8, 0.035, 0.9, 0.04, 0.001, 0.002, 0.6, 0.7])
q = np.array([(i+1)/g*alpha_FDR for i in range(g)])
p_value_rank = np.argsort(p_value)
for i in reversed(range(g)):
if q[i] >= p_value[p_value_rank[i]]:
break
print(f"Reject: {[f'G{idx}' for idx in idx[p_value_rank[:i+1]]]}")
Quantile-Quantile plot¶
x = -np.log10(np.array([(i+1)/g for i in range(g)]))
y = -np.log10(p_value[p_value_rank])
plt.plot(x,y,color="red",label="observed")
plt.plot((0,min(x[0],y[0])),(0,min(x[0],y[0])), color="black", label="$y=x$")
plt.title("Quantile-Quantile plot")
plt.xlabel("$-\log_{10}$ expected P-value")
plt.ylabel("$-\log_{10}$ observed P-value")
plt.legend()
plt.grid()
plt.show()
Section3.2 Probabilistic Models¶
Hypothesis Testing | Probabilistic Modeling |
---|---|
Models data generation processes in uninteresting scenarios $H_0$ | Models data generation processes under interesting biological scenario $H_1$ (possibly as well as $H_0$) |
Show inconsistency with $H_0$ | Check fit of the model to the data |
Problems of Hypothesis Testing Approaches¶
- Difficult to model satisfactory, uninteresting scenarios
- Does not model interesting biological scenarios
- A rejected hypothesis does not necessarily tell anything interesting
- Still useful for many small parts of a full analysis
probabilistic Modeling¶
- Strong regularity from the probability condition
- Can include various biological knowledge
- Adjustable to the quality, amount, randomness of the data
- Conditional probability makes it easy to integrate multiple models
- Once a satisfactory model is obtained, we can extract various information from expectation values
- Powerful general tools are available
- Maximal Likelihood
- EM algorithms
- Bayesian inference
- MCMC sampling
- etc...
Section3.3 Likelihood¶
- Q. How to adjust parameters??
- A. Maximize the Likelihood
likelihood¶
- observed data: $\mathbf{D} = \left\{x^{(h)}|h=1,\cdots,n\right\}$
- parametrized model distribution: $\mathbb{P}(x|\theta)$
- Likelihood: $L(\theta|D) = \mathbb{P}(D|\theta) = \prod_{h=1}^n\mathbb{P}\left(x^{(h)}|\theta\right)$