HMM

2020-01-10(Fri)
ML/HMM.py

Notebook

Example Notebook: Kerasy.examples.HMM.ipynb

The hidden Markov model (HMM) is a very powerful statistical method of characterizing the observed data samples of a discrete-time series. It's joint probability distribution over both latent and observed variables is then given by

$$ p(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})=p\left(\mathbf{z}_{1} | \boldsymbol{\pi}\right)\left[\prod_{n=2}^{N} p\left(\mathbf{z}_{n} | \mathbf{z}_{n-1}, \mathbf{A}\right)\right] \prod_{m=1}^{N} p\left(\mathbf{x}_{m} | \mathbf{z}_{m}, \boldsymbol{\phi}\right)\qquad (13.10) $$

where $\mathbf{Z}=\left\{z_1,\ldots,z_N\right\}$ are the discrete latent variables, $\mathbf{X}=\left\{x_1,\ldots,x_N\right\}$ are the corresponding observations, and $\boldsymbol{\theta} = \left\{\boldsymbol{\pi},\mathbf{A},\boldsymbol{\phi}\right\}$ denotes the set of parameters governing the model.

Name	Probability	Conditional Distribution
initial state	$$\pi_{k} \equiv p\left(z_{1 k}=1\right)$$	$$p\left(\mathbf{z}_{1} \mid \boldsymbol{\pi}\right)=\prod_{k=1}^{K} \pi_{k}^{z_{1 k}}\quad (13.8)$$
transition probability	$$A_{j k} \equiv p\left(z_{n k}=1\mid z_{n-1, j}=1\right)$$	$$p\left(\mathbf{z}_{n} \mid \mathbf{z}_{n-1}, \mathbf{A}\right)=\prod_{k=1}^{K} \prod_{j=1}^{K} A_{j k}^{z_{n-1, j} z_{n k}}\quad (13.7)$$
emission probability	depend on your model.	$$p\left(\mathbf{x}_n\mid\mathbf{z}_n,\boldsymbol{\phi}\right) = \prod_{k=1}^Kp\left(\mathbf{x}_n\mid\phi_k\right)^{z_{nk}}\quad (13.9)$$

If we have observed a data set $\mathbf{X} = \left\{x_1,\ldots,x_N\right\}$, we can determine the parameters of an HMM using maximum likelihood. The likelihood function is given by

$$ p(\mathbf{X} | \boldsymbol{\theta})=\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})\qquad (13.11) $$

However, $\sum_{\mathbf{Z}}$ leads to complex expressions, so we turn to the expectation maximization (EM) algorithm.

The EM algorithm starts with some initial selection for the model parameters, which we denote by $\boldsymbol{\theta}^{\text{old}}$.

In the E step, we take these parameter values and find the posterior distribution of the latent variables $p\left(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)$. We then, use this posterior distribution to evaluate the expectation of the logarithm of the complete-data likelihood function, as a function of the parameters $\boldsymbol{\theta}$, to give the function $Q\left(\boldsymbol{\theta},\boldsymbol{\theta}^{\text{old}}\right)$ defined by
$$ Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)=\sum_{\mathbf{Z}} p\left(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p\left(\mathbf{X},\mathbf{Z} | \boldsymbol{\theta}\right)\qquad (13.12) $$
At this point, it is convenient to introduce some notation. We shall use $\gamma(\mathbf{z}_n)$ to denote the marginal posterior distribution of a latent variable $\mathbf{z}_n$, and $\xi\left(z_{n-1, j}, z_{n k}\right)$ to denote the joint posterior distribution of two successive latent variables, so that
$$ \begin{aligned} \gamma\left(\mathbf{z}_{n}\right) &=p\left(\mathbf{z}_{n} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) &(13.13)\\ \xi\left(\mathbf{z}_{n-1}, \mathbf{z}_{n}\right) &=p\left(\mathbf{z}_{n-1}, \mathbf{z}_{n} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) &(13.14) \end{aligned} $$
If we substitute the joint distribution $p\left(\mathbf{X},\mathbf{Z}|\boldsymbol{\theta}\right)$ given by $(13.10)$ into $(13.12)$, and make use of the definition of $\gamma$ and $\xi$, we obtain
$$ \begin{aligned} Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)=& \sum_{k=1}^{K} \gamma\left(z_{1 k}\right) \ln \pi_{k}+\sum_{n=2}^{N} \sum_{j=1}^{K} \sum_{k=1}^{K} \xi\left(z_{n-1, j}, z_{n k}\right) \ln A_{j k} \\ &+\sum_{n=1}^{N} \sum_{k=1}^{K}\gamma\left(z_{n k}\right) \sum_{i=1}^Dx_{ni} \ln \phi_{i k} \end{aligned}\qquad (13.17) $$
The goal of the E step will be to evaluate the quantities $\gamma\left(\mathbf{z}_n\right)$ and $\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right)$ efficiently. (→ forward-backward algorithm.)

In the M step, we maximize $Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)$ with respect to the parameters $\boldsymbol{\theta} = \left\{\boldsymbol{\pi},\mathbf{A},\boldsymbol{\phi}\right\}$ in which we treat $\gamma\left(\mathbf{z}_n\right)$ and $\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right)$ as constant. Maximization with respect to $\boldsymbol{\pi}$ and $\mathbf{A}$ is easily achieved using appropriate Lagrange multipliers with the results
$$ \begin{aligned} \pi_{k}&= \frac{\gamma\left(z_{1 k}\right)}{\sum_{j=1}^{K} \gamma\left(z_{1 j}\right)} & (13.18)\\ A_{j k}&= \frac{\sum_{n=2}^{N} \xi\left(z_{n-1, j}, z_{n k}\right)}{\sum_{l=1}^{K} \sum_{n=2}^{N} \xi\left(z_{n-1, j}, z_{n l}\right)} & (13.19)\\ \end{aligned} $$
To maximize $Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)$ with respect to $\phi_k$, we notice that only the final term in $(13.17)$ depends on $\phi_k$, and the quantities $\gamma\left(z_{nk}\right)$ are playing the role of the responsibilities. If the parameters $\phi_k$ are independent for the different components, then this term decouples into a sum of terms one for each value $k$, each of which can be maximized independently. We are then simply maximizing the weighted log likelihood function for the emission density $p\left(\mathbf{x}|\phi_k\right)$ with weights $\gamma\left(z_{nk}\right)$.

Bernoulli HMM

class kerasy.ML.HMM.BernoulliHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="ite", random_state=None)

For the case of discrete Bernoulli observed variables, the conditional distribution of the observations takes the form

$$p\left(x|\mathbf{z}\right) = \prod_{k=1}^{K}\left(\theta_k^x\left(1-\theta_k\right)^{1-x}\right)^{z_k}$$

and the corresponding M-step equations are given by

$$\theta_k = \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_n}{\sum_{n=1}^N\gamma\left(z_{nk}\right)}$$

Multinomial HMM

class kerasy.ML.HMM.MultinomialHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="ite", random_state=None)

For the case of discrete multinomial observed variables, the conditional distribution of the observations takes the form

$$p\left(\mathbf{x}|\mathbf{z}\right) = \prod_{i=1}^D\prod_{k=1}^{K}\mu_{ik}^{x_iz_k}\quad (13.22)$$

※ When $k$ is $2$ and, the multinomial distribution is the Bernoulli distribution.

and the corresponding M-step equations are given by

$$\mu_{ik} = \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_{ni}}{\sum_{n=1}^N\gamma\left(z_{nk}\right)}\qquad (13.23)$$

Binomial HMM

class kerasy.ML.HMM.BinomialHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="ite", random_state=None)

For the case of discrete binomial observed variables, the conditional distribution of the observations takes the form

$$p\left(x|n,\mathbf{z}\right) = \prod_{k=1}^{K}\left(\begin{array}{l}n \\ x\end{array}\right) \left(\theta_k^x\left(1-\theta_k\right)^{n-x}\right)^{z_k}$$

※ A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

and the corresponding M-step equations are given by

$$\theta_k = \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_n}{\sum_{n=1}^N\gamma\left(z_{nk}\right)n_n}$$

Gaussian HMM

class kerasy.ML.HMM.GaussianHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="itmc", random_state=None,
    covariance_type="diag", min_covariance=1e-3)

Gaussian emission densities we have $p(\mathbf{x}|\boldsymbol{\phi}_k) = \mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu_k},\boldsymbol{\Sigma_k}\right)$, and maximization of the function $Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)$ then gives

$$ \begin{aligned} \boldsymbol{\mu_k} &= \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_n}{\sum_{n=1}^{N}\gamma\left(z_{nk}\right)} &(13.20)\\ \boldsymbol{\Sigma_k} &=\frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)\left(\mathbf{x}_n-\boldsymbol{\mu}_k\right)\left(\mathbf{x}_n-\boldsymbol{\mu}_k\right)^T}{\sum_{n=1}^N\gamma\left(z_{nk}\right)} &(13.21) \end{aligned} $$

Covariance type

definition

Name		example	parameter size
spherical	Each state uses a single variance value that applies to all features.	$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)} & & & 0 \\ & \sigma^{(k)} & & \\ & & \ddots & \\ 0 & & & \sigma^{(k)} \end{array}\right)$$	`nh` * 1
diag	Each state uses a diagonal covariance matrix.	$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)}_{1} & & & 0 \\ & \sigma^{(k)}_{2} & & \\ & & \ddots & \\ 0 & & & \sigma^{(k)}_{\text{nf}} \end{array}\right)$$	`nh` * `nf`
full	Each state uses a full (i.e. unrestricted) covariance matrix.	$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)}_{1} & \sigma^{(k)}_{2} & \cdots & \sigma^{(k)}_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} \\ \sigma^{(k)}_{2} & \sigma^{(k)}_{3} & \cdots & \sigma^{(k)}_{\frac{\left(\text{nf}-2\right)\cdot\left(\text{nf}-1\right)}{2} + 1}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma^{(k)}_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} & & & \sigma^{(k)}_{\frac{\text{nf}\ast\left(\text{nf}+1\right)}{2}} \end{array}\right)$$	`nh` * `nf` * (`nf` + 1) // 2,
tied	All states use the same full covariance matrix.	$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma_{1} & \sigma_{2} & \cdots & \sigma_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} \\ \sigma_{2} & \sigma_{3} & \cdots & \sigma_{\frac{\left(\text{nf}-2\right)\cdot\left(\text{nf}-1\right)}{2} + 1}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} & & & \sigma_{\frac{\text{nf}\ast\left(\text{nf}+1\right)}{2}} \end{array}\right)$$	`nf` * (`nf` + 1) // 2

conversion function

conversion functions are described at kerasy/utils/np_utils.py.

def compress_based_on_covariance_type_from_tied_shape(tied_cv, covariance_type, n_gaussian):
    """
    Create all the covariance matrices
    from a given template.
    """

def decompress_based_on_covariance_type(covars, covariance_type='full', n_gaussian=1, n_features=1):
    """
    Create the correct shape of covariance matris
    from each feature based on the covariance type.
    """

Gaussian Mixture HMM

class kerasy.ML.HMM.GaussianMixtureHMM()

Correspondence between code and formula

code	index	formula
`log_cond_prob`	$$[n][k] = p\left(\mathbf{x}_n\mid z_{nk}\right)$$	$$p\left(\mathbf{X}\mid\mathbf{Z}\right)$$
`log_prob`	constant.	$$\begin{aligned}p\left(\mathbf{X}\right) &=\sum_{\mathbf{z}_n}\alpha(\mathbf{z}_n)\beta(\mathbf{z}_n)\quad \text{for all $\mathbf{z}_n$}\\&=\sum_{\mathbf{z}_N}\alpha(\mathbf{z}_N)\end{aligned}$$
`log_alpha`	$$[n][k] = \alpha\left(z_{nk}\right)$$	$$\alpha\left(\mathbf{z}_n\right) \equiv p\left(\mathbf{x}_1,\ldots,\mathbf{x}_n,\mathbf{z}_n\right)$$
`log_beta`	$$[n][k] = \beta\left(z_{nk}\right)$$	$$\beta\left(\mathbf{z}_n\right)\equiv\left(\mathbf{x}_{n+1},\ldots,\mathbf{x}_N\mid\mathbf{z}_n\right)$$
`posterior_prob`	$$[n][k] = \gamma\left(z_{nk}\right)$$	$$\begin{aligned}\gamma\left(\mathbf{z}_n\right) &= p\left(\mathbf{z}_n\mid\mathbf{X}\right)= \frac{p\left(\mathbf{z}_n,\mathbf{X}\right)}{p\left(\mathbf{X}\right)} \\&= \frac{\alpha\left(\mathbf{z}_n\right)\beta\left(\mathbf{z}_n\right)}{p\left(\mathbf{X}\right)}\end{aligned}$$
`log_xi_sum`	$$[k][j]= \sum_{n=2}^N\xi\left(z_{n-1,k},z_{n,j}\right)$$	$$\begin{aligned}\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right) &= p\left(\mathbf{z}_{n-1},\mathbf{z}_n\mid\mathbf{X}\right) \\&= \frac{\alpha\left(\mathbf{z}_{n-1}\right)p\left(\mathbf{x}_n\mid\mathbf{z}_n\right)p\left(\mathbf{z}_n\mid\mathbf{z}_{n-1}\right)\beta\left(\mathbf{z}_n\right)}{p\left(\mathbf{X}\right)}\end{aligned}$$

Reference