Introduction

Maximum Likelihood Estimation (MLE) is a popular method in machine learning for learning with cost functions derived from probability distributions. It is especially popular in classification problems, where we can simply equate the cost of misclassification to the inverse probability of estimating the true class. This seems intuitive but how can we support this hypothesis through formal analysis?

Entropy Based Approach

Let's pull out a well-known measure of uncertainty from Information Theory: The entropy [math]\displaystyle{ H(x) }[/math] of a discrete random process [math]\displaystyle{ x }[/math] with probability distribution [math]\displaystyle{ P(\text{x} = x) }[/math] is defined as the expected value ([math]\displaystyle{ \mathbb{E}_{x \sim P} }[/math] or just [math]\displaystyle{ \mathbb{E}_{x} }[/math]) of the information [math]\displaystyle{ I(x) = -\log{P(x)} }[/math] generated by the process

[math]\displaystyle{ H(x) = \mathbb{E}_{x}\left( I(x) \right) = -\mathbb{E}_{x}\left( \log{P(x)} \right) }[/math]


where [math]\displaystyle{ -\log{P(x)} }[/math] assigns a higher score to less probable events and [math]\displaystyle{ -\mathbb{E}_{x}\left( \log{P(x)} \right) }[/math] grows as these less likely events accumulate. This goes hand in hand with the notion that a process with a lot of variation in it is more difficult to model and predict.

Based on the entropy of a process with a single random variable we define the cross-entropy [math]\displaystyle{ H(u, v) }[/math] of two probability distributions [math]\displaystyle{ u(x) }[/math] and [math]\displaystyle{ v(x) }[/math] based on the same random process represented by the random variable [math]\displaystyle{ x }[/math].

[math]\displaystyle{ H(u, v) = -\mathbb{E}_{u}\left( \log{v(x)} \right) }[/math]

for discrete random variables: [math]\displaystyle{ H(u, v) = -\sum_{i=1}u_i \log{ v_i } }[/math]


[math]\displaystyle{ H(u, v) }[/math] is at least as large as either [math]\displaystyle{ H(u) }[/math] or [math]\displaystyle{ H(v) }[/math]. That is, [math]\displaystyle{ H(u, v) \geq H(u) }[/math] and [math]\displaystyle{ H(u,v) \geq H(v) }[/math]. In words: The uncertainty in systems with more variables is higher than or equal to the uncertainty in simpler systems.

The cross entropy is a measure of disparity between two probability distributions that are meant to represent (or model) the same random process. The relevance of cross entropy to machine learning becomes apparent when one of the two probability distributions comes from the unknown process that generated the training data while the other probability distribution is simply an estimator (based on our chosen model) that approximates the unknown process. That is, from the entropy perspective, we want

[math]\displaystyle{ H(u, v) \approx H(u) \approx H(v) }[/math]


Remember that in machine learning we want to find the optimal model parameters [math]\displaystyle{ \theta^\star }[/math] for our model [math]\displaystyle{ \Phi(\theta) }[/math] that best approximates an unknown process [math]\displaystyle{ y(x) }[/math] with the estimate [math]\displaystyle{ \hat{y}(x) }[/math], where [math]\displaystyle{ x }[/math] is a vector of random variables (features) that we believe contribute to [math]\displaystyle{ y }[/math]. We also believe, from past experience, for example, that a certain class of estimator functions [math]\displaystyle{ \Phi }[/math] is a good choice for the problem. Finally, we try to find the vector of optimal parameters [math]\displaystyle{ \theta^\star }[/math], for the function we chose, by applying error-minimization algorithms on training data. Note that in Deep Learning the hocus-pocus of believe is replaced by learning the task-relevant set of features, [math]\displaystyle{ x }[/math], and the estimator function, [math]\displaystyle{ \Phi }[/math], from the data itself before tackling the optimization.


So, with the entropy based approach we simply take the cross entropy as the cost function [math]\displaystyle{ J }[/math] of our machine learning problem:

[math]\displaystyle{ J(\theta) = H(y,\hat{y}) = -\mathbb{E}_{y}\left( \log{P(\hat{y} \mid x)} \right) }[/math]


with [math]\displaystyle{ y }[/math] representing the (unknown) true process and [math]\displaystyle{ \hat{y} }[/math] representing the estimate. In classification problems [math]\displaystyle{ y }[/math] and [math]\displaystyle{ \hat{y} }[/math] are discrete random variables:

[math]\displaystyle{ J(\theta) = - \sum_{i=1}y_{i} \log{ P( \hat{y}_{i} \mid x_{i};\theta ) } }[/math]


where [math]\displaystyle{ (y_{i}, x_{i}) }[/math] are samples from the training data set while [math]\displaystyle{ \hat{y}_{i} }[/math] are our estimates. We can see in the above expression that samples for which the estimate has high probability ([math]\displaystyle{ P( \hat{y}_{i} \mid x_{i};\theta ) }[/math] close to 1.0) have negligible contribution to the cost ([math]\displaystyle{ \log{1.0} = 0.0 }[/math]) while samples for which the estimate has low probability ([math]\displaystyle{ P( \hat{y}_{i} \mid x_{i};\theta ) }[/math] closer to 0.0 than 1.0) have dramatically higher contribution to the cost.

Estimator Functions

Now that we know how to evaluate the performance of the estimator [math]\displaystyle{ P(\hat{y}) }[/math] we can go about the task of choosing a concrete function. Considering that we care more about the properties of [math]\displaystyle{ \log{ P(\hat{y} \mid x;\theta) } }[/math] than [math]\displaystyle{ P(\hat{y} \mid x;\theta) }[/math] we look for probability distributions that are easy to work with when their logarithm is taken. Criteria are differentiability, continuity up to nth order derivative, convergence speed, absence of local minima, ease of implementation in software, and the numerical stability of the implementation. Exponential functions are the obvious choice. The two most common are

  • Softmax [math]\displaystyle{ \sigma(z)_j = \frac{\mathrm{e}^{z_j}}{\sum_{k=1}^K{\mathrm{e}^{z_k}}} }[/math] with K-dimensional score vector [math]\displaystyle{ z }[/math], which is used in multi-class classification (K classes).
  • Sigmoid [math]\displaystyle{ \sigma(x) = \frac1{1 + \mathrm{e}^{-kx}} }[/math], which is used in binary classification.

We prefer using the logarithmic probability distribution function instead of the original distribution function because (1) it makes gradient descent algorithms perform better when the cost function has saturation regions, (2) it improves numerical stability by reducing the range without affecting. The saturation regions impact the performance of gradient descent algorithms: A large change in parameter space only has little effect in changing the cost. For certain classes of cost functions, like the sigmoid function in binary classification and the softmax function in multinoulli (multi-class) classification, taking the logarithm alleviates the saturation problem and thus improves the performance of gradient descent.

Note that in MLE we can not directly use an estimator that does not output probability values. Multi-class Support Vector Machines, for example, calculate scores that are not probability values and need to be followed by a softmax stage.

Optimization

We now know which estimation function to tweak such that the estimation error [math]\displaystyle{ J }[/math] becomes minimal. But what exactly are the parameters that we can tweak?

We were clever enough to lump together all the parameters to form a weighted sum (of the features of a data point) and to place that lump where it remains intact after differentiation. The optimization part of MLE is therefore the same as in other machine learning techniques where a linear weighted sum of features is driven to the point that minimizes the cost function.

[math]\displaystyle{ \theta^\star = \underset{\theta}{\arg\min}\sum_{i=1} y_i \log{P(\hat{y} \mid x;\theta)} }[/math]


The optimization part of MLE is not different from optimization in other machine learning Gradient descent is a stochastic gradient descent (SGD)

Why "Maximum Likelihood"?

Why is this technique called Maximum Likelihood instead of, for example, Minimum Cross-Entropy?
Because, given the cost function, we are trying to find the most likely parameters [math]\displaystyle{ \theta }[/math] that explain (estimate) the observed data. The assumption here is that we are confident that we have a good cost function and reduce the method to its less interesting optimization part. Because maybe the cost function is given to us by our teacher and we don't question it. Or because the historically people first arrived at exponential estimation functions from another approach than the energy (entropy) of the system. After all, all roads lead to Rome.


Debug data: