In [1]:

import numpy as np
import matplotlib.pyplot as plt

Section3.1 Multiple Test Correction¶

"The more inferences are made, the more likely erroneous inferences are to occur."(ex. Even if the type1 error rate is 1% for each data, 10 out of 1000 sample will be rejected on average.)

Terminology¶

FWER(Family-wise error rate): the probability of making at least one type I error (FP) in the family.
FDR(False discovery rate): the expected type I error (FP) rate in rejected data.

FDR(False discovery rate)¶

Null hypotheses: $H_1,\ldots,H_{g_j}$
Alternative hypotheses: $H_{g_j+1},\ldots,H_g$
Procedure to reject hypotheses: $d_{\alpha}: \left(P_1,P_2,\ldots,P_g\right)\rightarrow\left\{H_{i_1},\ldots,H_{i_K}\right\}$
Expected ratio of FP in rejected hypotheses: $\mathrm{FP}=\sum_{k=1}^K\mathbb{I}(1\leq i_k\leq g_j),\quad \mathrm{TP}=K-\mathrm{FP}$
$\mathcal{Q} = \begin{cases}0 & \text{if }K=(\mathrm{FP+TP})=0\\\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TP}} & \text{if }K>0\end{cases}$ $$\mathrm{FDR}(d_{\alpha})=\mathbb{E}\left(\mathcal{Q}|H_1,\ldots,H_g\right)$$

Methods to counteract this problem¶

Bonferroni Correction¶

Set $\alpha\left(=s_{\alpha}\leq S_h|H_0\right)\leftarrow 1/g\cdot\alpha_{\mathrm{FWER}}$ ($g$: sample size)

Therefore, we have: $$\mathbb{P}\left(\{s_{\alpha}\leq S_1\}\cup\cdots\cup\{s_{\alpha}\leq S_g\}|H_0\right)\leq\sum_{h=1}^g\mathbb{P}\left(s_{\alpha}\leq S_h|H_0\right) = g\cdot\left(\frac{1}{g}\alpha_{\mathrm{FWER}}\right) = \alpha_{\mathrm{FWER}}$$

Benjamini-Hochberg Procedure (B-H Procedure)¶

If we assume data are distributed independently, we have: $$\left(P_1,\ldots,P_g\right)\sim\mathrm{Unif}\left(0,1\right)^{\times g_j}\times\mathbb{P}\left(P|H_{\text{alternative}}\right)^{\times\left(g-g_j\right)}$$ We sort the p-value as $p_1\leq\cdots\leq p_g$ Then, $$ \begin{aligned} d_{\alpha_{\mathrm{FDR}}}^{\mathrm{B-H}}(p_1,\cdots,p_g) &:=\mathrm{Reject}\left\{H_i|1\leq i\max\left\{n|p_n\leq\left(\frac{n}{g}\alpha_{\mathrm{FDR}}\right)\right\}\right\}\\ &\Rightarrow \mathrm{FDR}\left(d_{\alpha_{\mathrm{FDR}}^{\mathrm{B-H}}}\right)\leq\alpha_{\mathrm{FDR}} \end{aligned} $$

Example.¶

$H_i$	$G_1$	$G_2$	$G_3$	$G_4$	$G_5$	$G_6$	$G_7$	$G_8$	$G_9$	$G_{10}$
$p_i$	$0.1$	$0.4$	$0.8$	$0.035$	$0.9$	$0.04$	$0.001$	$0.002$	$0.6$	$0.7$

In [2]:

alpha_FDR = 0.1
g = 10

In [3]:

idx = np.array([i+1 for i in range(g)])
p_value = data = np.array([0.1, 0.4, 0.8, 0.035, 0.9, 0.04, 0.001, 0.002, 0.6, 0.7])
q = np.array([(i+1)/g*alpha_FDR for i in range(g)])

In [4]:

p_value_rank = np.argsort(p_value)

In [5]:

for i in reversed(range(g)):
    if q[i] >= p_value[p_value_rank[i]]:
        break

In [6]:

print(f"Reject: {[f'G{idx}' for idx in idx[p_value_rank[:i+1]]]}")

Reject: ['G7', 'G8', 'G4', 'G6']

Quantile-Quantile plot¶

In [7]:

x = -np.log10(np.array([(i+1)/g for i in range(g)]))
y = -np.log10(p_value[p_value_rank])

In [8]:

plt.plot(x,y,color="red",label="observed")
plt.plot((0,min(x[0],y[0])),(0,min(x[0],y[0])), color="black", label="$y=x$")
plt.title("Quantile-Quantile plot")
plt.xlabel("$-\log_{10}$ expected P-value")
plt.ylabel("$-\log_{10}$ observed P-value")
plt.legend()
plt.grid()
plt.show()

Section3.2 Probabilistic Models¶

Hypothesis Testing	Probabilistic Modeling
Models data generation processes in uninteresting scenarios $H_0$	Models data generation processes under interesting biological scenario $H_1$ (possibly as well as $H_0$)
Show inconsistency with $H_0$	Check fit of the model to the data

Problems of Hypothesis Testing Approaches¶

Difficult to model satisfactory, uninteresting scenarios
Does not model interesting biological scenarios
A rejected hypothesis does not necessarily tell anything interesting
Still useful for many small parts of a full analysis

probabilistic Modeling¶

Strong regularity from the probability condition
Can include various biological knowledge
Adjustable to the quality, amount, randomness of the data
Conditional probability makes it easy to integrate multiple models
Once a satisfactory model is obtained, we can extract various information from expectation values
Powerful general tools are available
- Maximal Likelihood
- EM algorithms
- Bayesian inference
- MCMC sampling
- etc...

Section3.3 Likelihood¶

Q. How to adjust parameters??
A. Maximize the Likelihood

likelihood¶

observed data: $\mathbf{D} = \left\{x^{(h)}|h=1,\cdots,n\right\}$
parametrized model distribution: $\mathbb{P}(x|\theta)$
Likelihood: $L(\theta|D) = \mathbb{P}(D|\theta) = \prod_{h=1}^n\mathbb{P}\left(x^{(h)}|\theta\right)$

生物統計論第3回

Section3.1 Multiple Test Correction¶

Terminology¶

FDR(False discovery rate)¶

Methods to counteract this problem¶

Bonferroni Correction¶

Benjamini-Hochberg Procedure (B-H Procedure)¶

Example.¶

Quantile-Quantile plot¶

Section3.2 Probabilistic Models¶

Problems of Hypothesis Testing Approaches¶

probabilistic Modeling¶

Section3.3 Likelihood¶

likelihood¶

Table of Contents

Published

Last Updated

Category

Tags

Contact

Other contents