HMM
Notebook
Example Notebook: Kerasy.examples.HMM.ipynb
The hidden Markov model (HMM) is a very powerful statistical method of characterizing the observed data samples of a discrete-time series. It's joint probability distribution over both latent and observed variables is then given by
where \(\mathbf{Z}=\left\{z_1,\ldots,z_N\right\}\) are the discrete latent variables, \(\mathbf{X}=\left\{x_1,\ldots,x_N\right\}\) are the corresponding observations, and \(\boldsymbol{\theta} = \left\{\boldsymbol{\pi},\mathbf{A},\boldsymbol{\phi}\right\}\) denotes the set of parameters governing the model.
Name | Probability | Conditional Distribution |
---|---|---|
initial state |
$$\pi_{k} \equiv p\left(z_{1 k}=1\right)$$
|
$$p\left(\mathbf{z}_{1} \mid \boldsymbol{\pi}\right)=\prod_{k=1}^{K} \pi_{k}^{z_{1 k}}\quad (13.8)$$
|
transition probability |
$$A_{j k} \equiv p\left(z_{n k}=1\mid z_{n-1, j}=1\right)$$
|
$$p\left(\mathbf{z}_{n} \mid \mathbf{z}_{n-1}, \mathbf{A}\right)=\prod_{k=1}^{K} \prod_{j=1}^{K} A_{j k}^{z_{n-1, j} z_{n k}}\quad (13.7)$$
|
emission probability | depend on your model. |
$$p\left(\mathbf{x}_n\mid\mathbf{z}_n,\boldsymbol{\phi}\right) = \prod_{k=1}^Kp\left(\mathbf{x}_n\mid\phi_k\right)^{z_{nk}}\quad (13.9)$$
|
If we have observed a data set \(\mathbf{X} = \left\{x_1,\ldots,x_N\right\}\), we can determine the parameters of an HMM using maximum likelihood. The likelihood function is given by
However, \(\sum_{\mathbf{Z}}\) leads to complex expressions, so we turn to the expectation maximization (EM) algorithm.
The EM algorithm starts with some initial selection for the model parameters, which we denote by \(\boldsymbol{\theta}^{\text{old}}\).
In the E step, we take these parameter values and find the posterior distribution of the latent variables \(p\left(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)\). We then, use this posterior distribution to evaluate the expectation of the logarithm of the complete-data likelihood function, as a function of the parameters \(\boldsymbol{\theta}\), to give the function \(Q\left(\boldsymbol{\theta},\boldsymbol{\theta}^{\text{old}}\right)\) defined by
$$ Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)=\sum_{\mathbf{Z}} p\left(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p\left(\mathbf{X},\mathbf{Z} | \boldsymbol{\theta}\right)\qquad (13.12) $$At this point, it is convenient to introduce some notation. We shall use \(\gamma(\mathbf{z}_n)\) to denote the marginal posterior distribution of a latent variable \(\mathbf{z}_n\), and \(\xi\left(z_{n-1, j}, z_{n k}\right)\) to denote the joint posterior distribution of two successive latent variables, so that$$ \begin{aligned} \gamma\left(\mathbf{z}_{n}\right) &=p\left(\mathbf{z}_{n} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) &(13.13)\\ \xi\left(\mathbf{z}_{n-1}, \mathbf{z}_{n}\right) &=p\left(\mathbf{z}_{n-1}, \mathbf{z}_{n} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) &(13.14) \end{aligned} $$If we substitute the joint distribution \(p\left(\mathbf{X},\mathbf{Z}|\boldsymbol{\theta}\right)\) given by \((13.10)\) into \((13.12)\), and make use of the definition of \(\gamma\) and \(\xi\), we obtain$$ \begin{aligned} Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)=& \sum_{k=1}^{K} \gamma\left(z_{1 k}\right) \ln \pi_{k}+\sum_{n=2}^{N} \sum_{j=1}^{K} \sum_{k=1}^{K} \xi\left(z_{n-1, j}, z_{n k}\right) \ln A_{j k} \\ &+\sum_{n=1}^{N} \sum_{k=1}^{K}\gamma\left(z_{n k}\right) \sum_{i=1}^Dx_{ni} \ln \phi_{i k} \end{aligned}\qquad (13.17) $$The goal of the E step will be to evaluate the quantities \(\gamma\left(\mathbf{z}_n\right)\) and \(\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right)\) efficiently. (→ forward-backward algorithm.)In the M step, we maximize \(Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)\) with respect to the parameters \(\boldsymbol{\theta} = \left\{\boldsymbol{\pi},\mathbf{A},\boldsymbol{\phi}\right\}\) in which we treat \(\gamma\left(\mathbf{z}_n\right)\) and \(\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right)\) as constant. Maximization with respect to \(\boldsymbol{\pi}\) and \(\mathbf{A}\) is easily achieved using appropriate Lagrange multipliers with the results
$$ \begin{aligned} \pi_{k}&= \frac{\gamma\left(z_{1 k}\right)}{\sum_{j=1}^{K} \gamma\left(z_{1 j}\right)} & (13.18)\\ A_{j k}&= \frac{\sum_{n=2}^{N} \xi\left(z_{n-1, j}, z_{n k}\right)}{\sum_{l=1}^{K} \sum_{n=2}^{N} \xi\left(z_{n-1, j}, z_{n l}\right)} & (13.19)\\ \end{aligned} $$To maximize \(Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)\) with respect to \(\phi_k\), we notice that only the final term in \((13.17)\) depends on \(\phi_k\), and the quantities \(\gamma\left(z_{nk}\right)\) are playing the role of the responsibilities. If the parameters \(\phi_k\) are independent for the different components, then this term decouples into a sum of terms one for each value \(k\), each of which can be maximized independently. We are then simply maximizing the weighted log likelihood function for the emission density \(p\left(\mathbf{x}|\phi_k\right)\) with weights \(\gamma\left(z_{nk}\right)\).
Bernoulli HMM
class kerasy.ML.HMM.BernoulliHMM(
n_hstates=3, init="random", algorithm="viterbi",
up_params="ite", random_state=None)
For the case of discrete Bernoulli observed variables, the conditional distribution of the observations takes the form
and the corresponding M-step equations are given by
Multinomial HMM
class kerasy.ML.HMM.MultinomialHMM(
n_hstates=3, init="random", algorithm="viterbi",
up_params="ite", random_state=None)
For the case of discrete multinomial observed variables, the conditional distribution of the observations takes the form
※ When \(k\) is \(2\) and, the multinomial distribution is the Bernoulli distribution.
and the corresponding M-step equations are given by
Binomial HMM
class kerasy.ML.HMM.BinomialHMM(
n_hstates=3, init="random", algorithm="viterbi",
up_params="ite", random_state=None)
For the case of discrete binomial observed variables, the conditional distribution of the observations takes the form
※ A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.
and the corresponding M-step equations are given by
Gaussian HMM
class kerasy.ML.HMM.GaussianHMM(
n_hstates=3, init="random", algorithm="viterbi",
up_params="itmc", random_state=None,
covariance_type="diag", min_covariance=1e-3)
Gaussian emission densities we have \(p(\mathbf{x}|\boldsymbol{\phi}_k) = \mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu_k},\boldsymbol{\Sigma_k}\right)\), and maximization of the function \(Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)\) then gives
Covariance type
definition
Name | example | parameter size | |
---|---|---|---|
spherical | Each state uses a single variance value that applies to all features. |
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)} & & & 0 \\ & \sigma^{(k)} & & \\ & & \ddots & \\ 0 & & & \sigma^{(k)} \end{array}\right)$$
|
nh * 1 |
diag | Each state uses a diagonal covariance matrix. |
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)}_{1} & & & 0 \\ & \sigma^{(k)}_{2} & & \\ & & \ddots & \\ 0 & & & \sigma^{(k)}_{\text{nf}} \end{array}\right)$$
|
nh * nf |
full | Each state uses a full (i.e. unrestricted) covariance matrix. |
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)}_{1} & \sigma^{(k)}_{2} & \cdots & \sigma^{(k)}_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} \\ \sigma^{(k)}_{2} & \sigma^{(k)}_{3} & \cdots & \sigma^{(k)}_{\frac{\left(\text{nf}-2\right)\cdot\left(\text{nf}-1\right)}{2} + 1}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma^{(k)}_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} & & & \sigma^{(k)}_{\frac{\text{nf}\ast\left(\text{nf}+1\right)}{2}} \end{array}\right)$$
|
nh * nf * (nf + 1) // 2, |
tied | All states use the same full covariance matrix. |
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma_{1} & \sigma_{2} & \cdots & \sigma_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} \\ \sigma_{2} & \sigma_{3} & \cdots & \sigma_{\frac{\left(\text{nf}-2\right)\cdot\left(\text{nf}-1\right)}{2} + 1}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} & & & \sigma_{\frac{\text{nf}\ast\left(\text{nf}+1\right)}{2}} \end{array}\right)$$
|
nf * (nf + 1) // 2 |
conversion function
conversion functions are described at kerasy/utils/np_utils.py.
def compress_based_on_covariance_type_from_tied_shape(tied_cv, covariance_type, n_gaussian):
"""
Create all the covariance matrices
from a given template.
"""
def decompress_based_on_covariance_type(covars, covariance_type='full', n_gaussian=1, n_features=1):
"""
Create the correct shape of covariance matris
from each feature based on the covariance type.
"""
Gaussian Mixture HMM
class kerasy.ML.HMM.GaussianMixtureHMM()
Correspondence between code and formula
code | index | formula |
---|---|---|
log_cond_prob |
$$[n][k] = p\left(\mathbf{x}_n\mid z_{nk}\right)$$
|
$$p\left(\mathbf{X}\mid\mathbf{Z}\right)$$
|
log_prob |
constant. |
$$\begin{aligned}p\left(\mathbf{X}\right) &=\sum_{\mathbf{z}_n}\alpha(\mathbf{z}_n)\beta(\mathbf{z}_n)\quad \text{for all $\mathbf{z}_n$}\\&=\sum_{\mathbf{z}_N}\alpha(\mathbf{z}_N)\end{aligned}$$
|
log_alpha |
$$[n][k] = \alpha\left(z_{nk}\right)$$
|
$$\alpha\left(\mathbf{z}_n\right) \equiv p\left(\mathbf{x}_1,\ldots,\mathbf{x}_n,\mathbf{z}_n\right)$$
|
log_beta |
$$[n][k] = \beta\left(z_{nk}\right)$$
|
$$\beta\left(\mathbf{z}_n\right)\equiv\left(\mathbf{x}_{n+1},\ldots,\mathbf{x}_N\mid\mathbf{z}_n\right)$$
|
posterior_prob |
$$[n][k] = \gamma\left(z_{nk}\right)$$
|
$$\begin{aligned}\gamma\left(\mathbf{z}_n\right) &= p\left(\mathbf{z}_n\mid\mathbf{X}\right)= \frac{p\left(\mathbf{z}_n,\mathbf{X}\right)}{p\left(\mathbf{X}\right)} \\&= \frac{\alpha\left(\mathbf{z}_n\right)\beta\left(\mathbf{z}_n\right)}{p\left(\mathbf{X}\right)}\end{aligned}$$
|
log_xi_sum |
$$[k][j]= \sum_{n=2}^N\xi\left(z_{n-1,k},z_{n,j}\right)$$
|
$$\begin{aligned}\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right) &= p\left(\mathbf{z}_{n-1},\mathbf{z}_n\mid\mathbf{X}\right) \\&= \frac{\alpha\left(\mathbf{z}_{n-1}\right)p\left(\mathbf{x}_n\mid\mathbf{z}_n\right)p\left(\mathbf{z}_n\mid\mathbf{z}_{n-1}\right)\beta\left(\mathbf{z}_n\right)}{p\left(\mathbf{X}\right)}\end{aligned}$$
|
Reference