What is a likelihood function, actually?//blog č.cc

Suppose we have some data $\mathbf x = {x_1,\dots,x_n}$, which we assume to be independently and identically drawn from discrete random variable $X\sim p(\theta)$. The probability mass function of $X$ is $p(x|\theta)$, which yields the discrete probability $\textop{Pr}[X=x|\theta]$, i.e. the probability that a random draw from $X$ is equal to $x$, given model parameter $\theta$.

Most statistics texts define the likelihood $\mathcal L(\theta|\mathbf x)$ as the joint probability of the data being generated by $p(\theta)$. Because the data are assumed to be IID, this is simply the product of independent probabilities,

\[\mathcal{L}(\theta|\mathbf{x}) = p(x_1,\dots,x_n|\theta) = \prod_{i=1}^n p(x_i|\theta)= \textop{Pr}[X=x_i\cap\dots\cap X=x_n|\theta]%\prod_{i=1}^n\textop{Pr}[X=x_i|\theta].\]

Since this is a probability mass function, summing over all possible values $\mathcal{X}$ for each $x_1\dots x_n$ yields 1:

\[\sum_{x_1\in \mathcal{X}}\dots\sum_{x_n\in\mathcal{X}}\prod_{i=1}^n p(x_i|\theta)=1\]

To fit our model to the data, we maximize the likelihood with respect to model parameters. This makes intuitive sense: a higher $\mathcal L$ means that our generative model $p(\theta)$ has a higher probability of generating $\mathbf x$, and is thus a better fit. Finding the optimal value $\hat\theta = \mathop{\arg\max}_\theta\mathcal{L}(\theta|\mathbf{x})$ yields the maximum joint probability of the data.

A simple example

As a simple example, let $X$ be a binary (Bernoulli) random variable (e.g., a coin) with model parameter $\theta\equiv f$, representing the *fairness* of the coin. Thus, $f\in[0,1]$ is the probability that a random draw from $X$ yields 1 (or heads, $H$), so $1-f$ is the probability that $X=0$ (or tails, $T$). Then our probability mass function is $$p(x|f)= f^x(1-f)^{1-x}\Rightarrow \begin{cases} p(1|f) = f\\ p(0|f) = 1-f \end{cases}.$$ In the case of the coin, suppose that we don't know whether our coin is rigged or not, and want to infer $f$ from a sequence of coin flips. We flip the coin 5 times, and see 2 heads and 3 tails: $\mathbf x = \{H,H,T,T,T\}$. Our likelihood, the probability of independently seeing 2 heads and 3 tails for a given value of $f$, is simply the product of probabilities $$\mathcal{L}(f|\mathbf x)\equiv\textop{Pr}[\{H,H,T,T,T\}|f]=p(H|f)^2p(T|f)^3=f^2(1-f)^3.$$ Plotting $\mathcal{L}(f|\mathbf x)$ reveals that it peaks at $f=2/5$. This makes intuitive sense: the probability of seeing 2 heads and 3 tails from 5 coin flips is maximized if the coin is rigged to come up heads $f=2/5=40\%$ of the time. Put another way, if we observe 2 heads out of 5 total coin flips, the empirical probability of getting heads is $40\%$. If we flip the coin more times (e.g., 10, 50, 100) and still get heads $40\%$ of the time, we see that the corresponding likelihood functions peak at the same place, but with a much tighter bound. This is also intuitive: we become more confident that the coin is rigged the more times we flip it and observe a biased result. As a digression, the "tightness" of the likelihood function is the core of Bayesian inference. Recall that a likelihood is normalized in terms of the data[^1], but not in terms of the model parameter. Normalizing by the model parameter $\theta$ (after optionally weighting by some prior distribution on $\theta$) yields the posterior distribution on $\theta$: [^1]: For example, for two coin flips, we have four possible outcomes: $\mathcal{X}=\{\{H,H\},\{T,H\},\{H,T\},\{T,T\}\}$. Then $\sum_{\mathbf{x}\in\mathcal{X}}\mathcal{L}(f|\mathbf x)=1$. $$p(\theta|\mathbf x) = \frac{\prod_{i=1}^n \lt[p(x_i|\theta) \rt]p(\theta)}{\int_\Theta d\theta\,\prod_{i=1}^n p(x_i|\theta) p(\theta)}.$$ In the case of our coin, we obtain the beta distribution: $$\begin{align*} p(f|n_H,n_T)&=\frac{f^{n_H}(1-f)^{n_T}}{\int_0^1 df\,f^{n_H}(1-f)^{n_T}}\\ &=\frac{f^{n_H}(1-f)^{n_T}}{\beta(n_H+1,n_T+1)}\\ &\equiv\textop{Beta}(f|n_H+1,n_T+1). \end{align*}$$

Continuous likelihoods

Now consider the case of a continuous random variable $Y\sim p(\phi)$, with probability density function $p(y|\phi)$. Because it is a density function, point evaluations at some specific value of $y$ are meaningless; rather, integrating $p(y|\phi)$ over $y\in[a,b]$ yields the probability that $Y$ is in the range $[a,b]$:

\[\textop{Pr}[Y\in[a,b]|\phi]=\int_a^b dy\,p(y|\phi).\]

The likelihood of $\phi$ given some IID data $\mathbf y={y_1,\dots,y_n}$ is identical to the likelihood in the discrete case, i.e. $\mathcal{L}(\phi|\mathbf y) = \prod_{i=1}^n p(y_i|\phi)$. This is the joint probability density of $\mathbf{y}$, i.e.

\[\int_{a_1}^{b_1} dy_1\dots\int_{a_n}^{b_n} dy_n \prod_{i=1}^n p(y_i|\phi) = \textop{Pr}[Y\in [a_1,b_1]\cap\dots\cap Y\in [a_n,b_n]|\phi].\]

and of course, integrating over the full domain $\mathcal{Y}$ of $Y$ yields 1,

\[\int_\mathcal{Y} dy_1\dots\int_\mathcal{Y} dy_n \prod_{i=1}^n p(y_i|\phi) = 1.\]

It is important to reiterate that this is a probability density, not a probability as in the discrete case, so $p(y|\phi)\ne \textop{Pr}[Y=y|\phi]$, and $p(y_1,\dots,y_n|\phi)\ne\textop{Pr}[Y=y_1\cap\dots\cap Y=y_n|\phi]$.

Nonetheless, we still use continuous likelihoods in exactly the same way we use discrete likelihoods. A higher continuous $\mathcal L$ means that our generative model better generates our data, despite the fact that it does not actually represent the joint probability of generating the observed data given the model; it is merely a joint probability density evaluated at a specific point, which is probabilistically meaningless.

It is not intuitive why this should still work, and most introductory statistics texts gloss over this. To elucidate what a continuous likelihood actually means and rigorously justify its use, we need to make use of a special probability distribution: the Dirac delta distribution.

A Dirac delta distribution detour

The Dirac delta (Dδ) distribution is a continuous probability distribution whose probability density function is defined as

\[\delta(x) = \begin{cases} \infty & x = 0\\ 0 & x \ne 0 \end{cases},\]

normalized such that

\[\int_{-\infty}^\infty dx\,\delta(x)=1.\]

This satisfies all the Kolmogorov probability axioms and is thus a true probability distribution.

We can more concretely define the Dδ distribution as the limit as $\sigma\to0$ of a normal distribution with $\mu=0$:

\[\delta(x) = \lim_{\sigma\to 0}\mathcal{N}(x|0,\sigma)=\lim_{\sigma\to 0}\frac{1}{\sigma\sqrt{2\pi}}\exp\lt(-\frac{x^2}{2\sigma^2}\rt)\]

One useful property of the Dδ distribution is that if we recenter it at $y$ (i.e. $\delta’(x)=\delta(x-y)$), the expected value $E_{x\sim\delta’}[f(x)]$ will evaluate to $f(y)$:

\[E_{x\sim\delta'}[f(x)]=\int_{-\infty}^\infty dx\, f(x)\delta(x-y)=f(y).\]

Now let’s take a mixture of Dδ distributions, each centered at $\mathbf{y}=y_1,\dots,y_n$:

\[M_\delta(x;\mathbf{y}) = n^{-1}\sum_{i=1}^n \delta(x-y_i).\]

Note that this too is a true probability distribution, since $\int_{-\infty}^\infty dx\,M_\delta(x;\mathbf{y})=1$.

Then

\[\begin{align*} E_{x\sim M_\delta}[f(x)] &= \int_{-\infty}^\infty dx\,f(x)M_\delta(x;\mathbf y)\\ &=n^{-1}\sum_{i=1}^n\int_{-\infty}^\infty dx\,f(x)\delta(x-y_i)\\ &=n^{-1}\sum_{i=1}^n f(y_i). \end{align*}\]

So, what is a likelihood, really?

Taking the log of a likelihood function converts the product into a sum:

\[\log\mathcal{L}(\phi|\mathbf y) = \log\prod_{i=1}^n p(y_i|\phi)=\sum_{i=1}^n\log p(y_i|\phi).\]

We can express this sum as a mixture of Dδ distributions:

\[\begin{align*} \sum_{i=1}^n\log p(y_i|\phi)&=\int_{-\infty}^\infty dx\,\log p(x|\phi)M_\delta(x;\mathbf{y})\\ &=\int_{-\infty}^\infty \log p(x|\phi)n^{-1}\sum_{i=1}^n \delta(x - y_i)\\ &\propto -E_{x\sim M_\delta}[\log p(x|\phi)]\\ &\equiv H(M_\delta, p) \end{align*}\]

In information theory, $H(M_\delta,p)$ is the cross entropy between $M_\delta$ and $p$. Recall that entropy $H(p)$ is the amount of information required to unambiguously identify a set of draws from random variable $X\sim p(\theta)$. In the discrete case, if $X$ can yield $\mathcal{X}={x_1,\dots,x_n}$ possible values, entropy is the information content $I(p(x_i|\theta)) = -\log p(x_i|\theta)$ of each possible value of $x_i\in \mathcal{X}$, weighted by the probability of observing that particular $x_i$:

\[H(p) = -\sum_{i=1}^n p(x_i|\theta)\log_b p(x_i|\theta).\]

Intuitively, the information content represents the minimal number of optimally posed $b$-ary questions one would have to ask to ascertain the value of a draw from $X$.

Cross entropy $H(p,q)$ measures the information content of a second random variable $Z\sim q(\eta)$ (with the same support as $X$), weighted by the probability of observing $x_i$ from $X$:

\[H(p,q) = -\sum_{i=1}^n p(x_i|\theta)\log q(x_i|\eta)\]

In other words, it measures the amount of information required to identify a draw from $X\sim p$, but if the information content of each draw instead comes from a different probability distribution, $q$.

In the continuous case, the sums are replaced by integrals:

\[H(p) = -\int_\mathcal{Y} dy\,p(y|\phi)\log p(y|\phi).\]

and

\[H(p,q) = -\int_\mathcal{Y} dy\,p(y|\phi)\log q(y|\xi).\]

In the case of our likelihood, we are trying to encode $M_\delta$, the empirical distribution of observed data, with our chosen generative model $p(\phi)$. As $\phi$ is tweaked, $p(\phi)$ becomes a better or worse approximation of $M_\delta$. Maximizing likelihood is equivalent to minimizing the cross entropy between the data distribution and the distribution prescribed by our generative model.

While the likelihood of the parameter of the probability mass function for a discrete random variable is indeed the joint probability of observing the data under a given model, this is not the general definition of statistical likelihood! It is more rigorous to view it as the amount of information lost trying to approximately represent the observed data with a generative model.

A more generalized likelihood?

Formulating the likelihood as a cross entropy rather than a joint probability [density] lets us extend it beyond the typical use case of fitting some generative model $p(\phi)$ to a set of discrete outcomes $\mathbf x = {x_1,\dots,x_n}$ drawn from a random variable $X$. Suppose instead that drawing from $X$ did not yield discrete values, but rather another set of random variables $\mathbf{\Theta} = {X_1\sim \pi(\theta_1),\dots,X_n\sim \pi(\theta_n)}$, whose generative model is $\pi(\theta)$.

For example, a measurement instrument might return both a estimated mean value $\mu$ for a measurement, and an uncertainty $\sigma$ about that mean¹. Given a set of many outputs $\mathbf{\Theta} = {(\mu_1,\sigma_1),\dots,(\mu_n,\sigma_n)}$ from this instrument, how might we best fit an overall generative model for the entity being measured? We may wish to measure the same entity multiple times, and fit a single model across all measurements to deacrease our measurement uncertainty. Or, we may know a priori that our entities come from multiple distinct populations, and we want to fit a mixture model to be able to distinguish them.

We can do this with the cross entropy formulation. Rather than expressing the log-likelihood as the cross entropy between the model we are trying to fit $p(\phi)$ and a mixture of Dδ distributions representing the discrete data distribution, i.e.

\[\begin{align*} \log\mathcal{L}(\phi|\mathbf{y}) = \sum_{i=1}^n\log p(y_i|\phi)&\propto-\int_{-\infty}^\infty dx\,\log p(x|\phi)\times \sum_{i=1}^n \delta(x - y_i)\\ &\equiv H(M_\delta,p), \end{align*}\]

we can replace the Dδ mixture with a mixture of density functions $M_\pi(x;\mathbf{\Theta}) = n^{-1}\sum_{i=1}^n \pi(x|\theta_i)$ corresponding to the generative model $\pi(\theta)$ of our instrument:

\[\log\mathcal{L}(\phi|\mathbf{\Theta}) \propto H(M_\pi,p) \equiv -\int_\mathcal{X} dx\,\log p(x|\phi)\times \sum_{i=1}^n \pi(x|\theta_i).\]

As a real life example, when electrons decay from a higher energy level to a lower energy level, they emit a photon, the frequency of which is a random variable (following a Voigt distribution) due to quantum mechanical and thermodynamic uncertainty. A spectrometer will report the parameterization of this distribution for each spectral line. ↩