HMM

Notebook

Example Notebook: Kerasy.examples.HMM.ipynb

The hidden Markov model (HMM) is a very powerful statistical method of characterizing the observed data samples of a discrete-time series. It's joint probability distribution over both latent and observed variables is then given by

$$ p(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})=p\left(\mathbf{z}_{1} | \boldsymbol{\pi}\right)\left[\prod_{n=2}^{N} p\left(\mathbf{z}_{n} | \mathbf{z}_{n-1}, \mathbf{A}\right)\right] \prod_{m=1}^{N} p\left(\mathbf{x}_{m} | \mathbf{z}_{m}, \boldsymbol{\phi}\right)\qquad (13.10) $$

where \(\mathbf{Z}=\left\{z_1,\ldots,z_N\right\}\) are the discrete latent variables, \(\mathbf{X}=\left\{x_1,\ldots,x_N\right\}\) are the corresponding observations, and \(\boldsymbol{\theta} = \left\{\boldsymbol{\pi},\mathbf{A},\boldsymbol{\phi}\right\}\) denotes the set of parameters governing the model.

Name Probability Conditional Distribution
initial state
$$\pi_{k} \equiv p\left(z_{1 k}=1\right)$$
$$p\left(\mathbf{z}_{1} \mid \boldsymbol{\pi}\right)=\prod_{k=1}^{K} \pi_{k}^{z_{1 k}}\quad (13.8)$$
transition probability
$$A_{j k} \equiv p\left(z_{n k}=1\mid z_{n-1, j}=1\right)$$
$$p\left(\mathbf{z}_{n} \mid \mathbf{z}_{n-1}, \mathbf{A}\right)=\prod_{k=1}^{K} \prod_{j=1}^{K} A_{j k}^{z_{n-1, j} z_{n k}}\quad (13.7)$$
emission probability depend on your model.
$$p\left(\mathbf{x}_n\mid\mathbf{z}_n,\boldsymbol{\phi}\right) = \prod_{k=1}^Kp\left(\mathbf{x}_n\mid\phi_k\right)^{z_{nk}}\quad (13.9)$$

If we have observed a data set \(\mathbf{X} = \left\{x_1,\ldots,x_N\right\}\), we can determine the parameters of an HMM using maximum likelihood. The likelihood function is given by

$$ p(\mathbf{X} | \boldsymbol{\theta})=\sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z} | \boldsymbol{\theta})\qquad (13.11) $$

However, \(\sum_{\mathbf{Z}}\) leads to complex expressions, so we turn to the expectation maximization (EM) algorithm.

The EM algorithm starts with some initial selection for the model parameters, which we denote by \(\boldsymbol{\theta}^{\text{old}}\).

In the E step, we take these parameter values and find the posterior distribution of the latent variables \(p\left(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right)\). We then, use this posterior distribution to evaluate the expectation of the logarithm of the complete-data likelihood function, as a function of the parameters \(\boldsymbol{\theta}\), to give the function \(Q\left(\boldsymbol{\theta},\boldsymbol{\theta}^{\text{old}}\right)\) defined by

$$ Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text {old }}\right)=\sum_{\mathbf{Z}} p\left(\mathbf{Z} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) \ln p\left(\mathbf{X},\mathbf{Z} | \boldsymbol{\theta}\right)\qquad (13.12) $$
At this point, it is convenient to introduce some notation. We shall use \(\gamma(\mathbf{z}_n)\) to denote the marginal posterior distribution of a latent variable \(\mathbf{z}_n\), and \(\xi\left(z_{n-1, j}, z_{n k}\right)\) to denote the joint posterior distribution of two successive latent variables, so that
$$ \begin{aligned} \gamma\left(\mathbf{z}_{n}\right) &=p\left(\mathbf{z}_{n} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) &(13.13)\\ \xi\left(\mathbf{z}_{n-1}, \mathbf{z}_{n}\right) &=p\left(\mathbf{z}_{n-1}, \mathbf{z}_{n} | \mathbf{X}, \boldsymbol{\theta}^{\text {old }}\right) &(13.14) \end{aligned} $$
If we substitute the joint distribution \(p\left(\mathbf{X},\mathbf{Z}|\boldsymbol{\theta}\right)\) given by \((13.10)\) into \((13.12)\), and make use of the definition of \(\gamma\) and \(\xi\), we obtain
$$ \begin{aligned} Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)=& \sum_{k=1}^{K} \gamma\left(z_{1 k}\right) \ln \pi_{k}+\sum_{n=2}^{N} \sum_{j=1}^{K} \sum_{k=1}^{K} \xi\left(z_{n-1, j}, z_{n k}\right) \ln A_{j k} \\ &+\sum_{n=1}^{N} \sum_{k=1}^{K}\gamma\left(z_{n k}\right) \sum_{i=1}^Dx_{ni} \ln \phi_{i k} \end{aligned}\qquad (13.17) $$
The goal of the E step will be to evaluate the quantities \(\gamma\left(\mathbf{z}_n\right)\) and \(\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right)\) efficiently. (→ forward-backward algorithm.)

In the M step, we maximize \(Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)\) with respect to the parameters \(\boldsymbol{\theta} = \left\{\boldsymbol{\pi},\mathbf{A},\boldsymbol{\phi}\right\}\) in which we treat \(\gamma\left(\mathbf{z}_n\right)\) and \(\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right)\) as constant. Maximization with respect to \(\boldsymbol{\pi}\) and \(\mathbf{A}\) is easily achieved using appropriate Lagrange multipliers with the results

$$ \begin{aligned} \pi_{k}&= \frac{\gamma\left(z_{1 k}\right)}{\sum_{j=1}^{K} \gamma\left(z_{1 j}\right)} & (13.18)\\ A_{j k}&= \frac{\sum_{n=2}^{N} \xi\left(z_{n-1, j}, z_{n k}\right)}{\sum_{l=1}^{K} \sum_{n=2}^{N} \xi\left(z_{n-1, j}, z_{n l}\right)} & (13.19)\\ \end{aligned} $$
To maximize \(Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)\) with respect to \(\phi_k\), we notice that only the final term in \((13.17)\) depends on \(\phi_k\), and the quantities \(\gamma\left(z_{nk}\right)\) are playing the role of the responsibilities. If the parameters \(\phi_k\) are independent for the different components, then this term decouples into a sum of terms one for each value \(k\), each of which can be maximized independently. We are then simply maximizing the weighted log likelihood function for the emission density \(p\left(\mathbf{x}|\phi_k\right)\) with weights \(\gamma\left(z_{nk}\right)\).

Bernoulli HMM

class kerasy.ML.HMM.BernoulliHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="ite", random_state=None)

For the case of discrete Bernoulli observed variables, the conditional distribution of the observations takes the form

$$p\left(x|\mathbf{z}\right) = \prod_{k=1}^{K}\left(\theta_k^x\left(1-\theta_k\right)^{1-x}\right)^{z_k}$$

and the corresponding M-step equations are given by

$$\theta_k = \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_n}{\sum_{n=1}^N\gamma\left(z_{nk}\right)}$$

Multinomial HMM

class kerasy.ML.HMM.MultinomialHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="ite", random_state=None)

For the case of discrete multinomial observed variables, the conditional distribution of the observations takes the form

$$p\left(\mathbf{x}|\mathbf{z}\right) = \prod_{i=1}^D\prod_{k=1}^{K}\mu_{ik}^{x_iz_k}\quad (13.22)$$

When \(k\) is \(2\) and, the multinomial distribution is the Bernoulli distribution.

and the corresponding M-step equations are given by

$$\mu_{ik} = \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_{ni}}{\sum_{n=1}^N\gamma\left(z_{nk}\right)}\qquad (13.23)$$

Binomial HMM

class kerasy.ML.HMM.BinomialHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="ite", random_state=None)

For the case of discrete binomial observed variables, the conditional distribution of the observations takes the form

$$p\left(x|n,\mathbf{z}\right) = \prod_{k=1}^{K}\left(\begin{array}{l}n \\ x\end{array}\right) \left(\theta_k^x\left(1-\theta_k\right)^{n-x}\right)^{z_k}$$

A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution.

and the corresponding M-step equations are given by

$$\theta_k = \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_n}{\sum_{n=1}^N\gamma\left(z_{nk}\right)n_n}$$

Gaussian HMM

class kerasy.ML.HMM.GaussianHMM(
    n_hstates=3, init="random", algorithm="viterbi",
    up_params="itmc", random_state=None,
    covariance_type="diag", min_covariance=1e-3)

Gaussian emission densities we have \(p(\mathbf{x}|\boldsymbol{\phi}_k) = \mathcal{N}\left(\mathbf{x}|\boldsymbol{\mu_k},\boldsymbol{\Sigma_k}\right)\), and maximization of the function \(Q\left(\boldsymbol{\theta}, \boldsymbol{\theta}^{\mathrm{old}}\right)\) then gives

$$ \begin{aligned} \boldsymbol{\mu_k} &= \frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)x_n}{\sum_{n=1}^{N}\gamma\left(z_{nk}\right)} &(13.20)\\ \boldsymbol{\Sigma_k} &=\frac{\sum_{n=1}^N\gamma\left(z_{nk}\right)\left(\mathbf{x}_n-\boldsymbol{\mu}_k\right)\left(\mathbf{x}_n-\boldsymbol{\mu}_k\right)^T}{\sum_{n=1}^N\gamma\left(z_{nk}\right)} &(13.21) \end{aligned} $$

Covariance type

definition
Name example parameter size
spherical Each state uses a single variance value that applies to all features.
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)} & & & 0 \\ & \sigma^{(k)} & & \\ & & \ddots & \\ 0 & & & \sigma^{(k)} \end{array}\right)$$
nh * 1
diag Each state uses a diagonal covariance matrix.
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)}_{1} & & & 0 \\ & \sigma^{(k)}_{2} & & \\ & & \ddots & \\ 0 & & & \sigma^{(k)}_{\text{nf}} \end{array}\right)$$
nh * nf
full Each state uses a full (i.e. unrestricted) covariance matrix.
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma^{(k)}_{1} & \sigma^{(k)}_{2} & \cdots & \sigma^{(k)}_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} \\ \sigma^{(k)}_{2} & \sigma^{(k)}_{3} & \cdots & \sigma^{(k)}_{\frac{\left(\text{nf}-2\right)\cdot\left(\text{nf}-1\right)}{2} + 1}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma^{(k)}_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} & & & \sigma^{(k)}_{\frac{\text{nf}\ast\left(\text{nf}+1\right)}{2}} \end{array}\right)$$
nh * nf * (nf + 1) // 2,
tied All states use the same full covariance matrix.
$$\boldsymbol{\Sigma_k} = \left(\begin{array}{cccc} \sigma_{1} & \sigma_{2} & \cdots & \sigma_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} \\ \sigma_{2} & \sigma_{3} & \cdots & \sigma_{\frac{\left(\text{nf}-2\right)\cdot\left(\text{nf}-1\right)}{2} + 1}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{\frac{\left(\text{nf}-1\right)\cdot\text{nf}}{2} + 1} & & & \sigma_{\frac{\text{nf}\ast\left(\text{nf}+1\right)}{2}} \end{array}\right)$$
nf * (nf + 1) // 2
conversion function

conversion functions are described at kerasy/utils/np_utils.py.

def compress_based_on_covariance_type_from_tied_shape(tied_cv, covariance_type, n_gaussian):
    """
    Create all the covariance matrices
    from a given template.
    """

def decompress_based_on_covariance_type(covars, covariance_type='full', n_gaussian=1, n_features=1):
    """
    Create the correct shape of covariance matris
    from each feature based on the covariance type.
    """

Gaussian Mixture HMM

class kerasy.ML.HMM.GaussianMixtureHMM()

Correspondence between code and formula

code index formula
log_cond_prob
$$[n][k] = p\left(\mathbf{x}_n\mid z_{nk}\right)$$
$$p\left(\mathbf{X}\mid\mathbf{Z}\right)$$
log_prob constant.
$$\begin{aligned}p\left(\mathbf{X}\right) &=\sum_{\mathbf{z}_n}\alpha(\mathbf{z}_n)\beta(\mathbf{z}_n)\quad \text{for all $\mathbf{z}_n$}\\&=\sum_{\mathbf{z}_N}\alpha(\mathbf{z}_N)\end{aligned}$$
log_alpha
$$[n][k] = \alpha\left(z_{nk}\right)$$
$$\alpha\left(\mathbf{z}_n\right) \equiv p\left(\mathbf{x}_1,\ldots,\mathbf{x}_n,\mathbf{z}_n\right)$$
log_beta
$$[n][k] = \beta\left(z_{nk}\right)$$
$$\beta\left(\mathbf{z}_n\right)\equiv\left(\mathbf{x}_{n+1},\ldots,\mathbf{x}_N\mid\mathbf{z}_n\right)$$
posterior_prob
$$[n][k] = \gamma\left(z_{nk}\right)$$
$$\begin{aligned}\gamma\left(\mathbf{z}_n\right) &= p\left(\mathbf{z}_n\mid\mathbf{X}\right)= \frac{p\left(\mathbf{z}_n,\mathbf{X}\right)}{p\left(\mathbf{X}\right)} \\&= \frac{\alpha\left(\mathbf{z}_n\right)\beta\left(\mathbf{z}_n\right)}{p\left(\mathbf{X}\right)}\end{aligned}$$
log_xi_sum
$$[k][j]= \sum_{n=2}^N\xi\left(z_{n-1,k},z_{n,j}\right)$$
$$\begin{aligned}\xi\left(\mathbf{z}_{n-1},\mathbf{z}_n\right) &= p\left(\mathbf{z}_{n-1},\mathbf{z}_n\mid\mathbf{X}\right) \\&= \frac{\alpha\left(\mathbf{z}_{n-1}\right)p\left(\mathbf{x}_n\mid\mathbf{z}_n\right)p\left(\mathbf{z}_n\mid\mathbf{z}_{n-1}\right)\beta\left(\mathbf{z}_n\right)}{p\left(\mathbf{X}\right)}\end{aligned}$$