Somehow I've never found an explanation for MLE that's intuitive and suitable for someone who didn't take graduate-level statistics already. I'm 100% on board with the introduction (MSE and cross-entropy make total intuitive sense; you can see how they penalize 'wrongness' to an increasing degree) but in the next paragraph we jump right to:
>Let pmodel(x;θ) be a parametric family of distributions over a space of parameters ...
and it's straight to the grad-level textbook stuff that breezily assumes familiarity with advanced mathematical notation.
One of the reasons I loved Andrew Ng's machine learning course so much is that it eased you into understanding the notation, terminology, and signposted things like "hey this is really important" vs. "hey this is just a weird notational quirk that mathematicians have, don't worry about it too much."
“an explanation for MLE”
I used to get by on, “it’s the parameters that make the data most likely”, like it says on the name. I think that’s what you are after.
Then I took a stats class, and I know to say, “the MLE is minimum variance within the class of asymptotically unbiased estimators” … that is, “efficient” and “asymptotically consistent“ in the jargon. (Subject to caveats.)
Then I took a Bayesian stats class and learned to say, “it’s minimum risk under a (improper) uniform prior.”
I also recall there is a general result showing that any estimator which makes the score function zero has good properties with respect to average loss. So zero’ing the score by maximizing likelihood is a good strategy. (If someone could remind me of specifics, that would be great.)
But perhaps Gauss had it right when he exploited the (known, but yet un-named) central limit theorem and used how easy it is to maximize the quadratic that sits atop its “e”. (https://arxiv.org/pdf/0804.2996, page 3, top). It’s so easy we had to find a justification for using it?
This notation doesn't require graduate-level statistics knowledge, it's more like stuff that would be covered in a first mathematical course on probability and statistics. It's totally practical to learn this stuff on your own from videos, books, and PDFs. First you need to get a solid conceptual grasp on probability distributions and their associated concepts like random variables, conditional probability, and joint probability. Then you'll be ready to learn some mathematical statistics and follow along with all the notation.
Note that you don't need to go deep into measure-theoretic probability or any of that stuff that requires more advanced prior education in math.
> This notation doesn't require graduate-level statistics knowledge, it's more like stuff that would be covered in a first mathematical course on probability and statistics.
Perhaps a first course at grad level, but my engineering bachelors covered MLEs but we didn’t learn/use any of those formal language things. I think the core mathematics (and likely other pure science) cohorts were the only people who learnt it.
I slowly transfered out of Trad. Engineering (Civil/Mech/Electrical/Electronic) pretty much because Engineering Math, Chem, and Physics units were almost all "learn these results and how to apply them" and little to no "these are the underpinings of these results".
It took six months for Math 100 (Maths for wanna be mathematicians) to "catch up" with the applications being spat out in Math 101 (Maths for people that practically use math for applications) but by the time the foundations were laid almost all the applied math in the Engineering coursework results just became "an exercise for the reader" to derive without need for rote memorisation.
> This notation doesn't require graduate-level statistics knowledge, it's more like stuff that would be covered in a first mathematical course on probability and statistics.
My courses definitely used different notation for the same semantics.
I took stats courses and didn’t see that notation before. And yes, looks cryptic.
One initial thing to understand is that the probability mass/density functions that you get taught in connection with standard probability distributions (binomial, Normal, etc) are functions of the data values: you put in a data value and the function outputs a probability (density), for some fixed parameter values.
At first glance likelihood functions might look the same, but you have to think of them as functions of the parameters; it's the data that's fixed now (it's whatever you observed in your experiment). Once that's clear, the calculus starts to makes sense -- using the derivative of the likelihood function w.r.t. the parameters to find points in parameter space that are local maxima (or directions that are uphill in parameter space etc).
So given a model with unknown parameters, the data set you observe gives rise to a particular likelihood function, in other words the data set gives rise to a surface over your parameter space that you can explore for maxima. Regions of parameter space where your model gives a high probability to your observed data are considered to be regions of parameter space that your data suggests might describe how reality actually is. Of course, that's not taking into account your prior beliefs about which regions of parameter space are plausible, or whether the model was a good choice in the first place, or whether you've got enough data, etc.
An important point here is that the integral of the likelihood function over different parameter values is not constrained to be 1. This is why a likelihood is not a probability or a probability density, but its own thing. The confusing bit is that the likelihood formula is exactly the same as the formula of the original probability density function...
Splitting hairs probably but personally I'd say it the other way around: a likelihood is not a probability or a probability density, so there's no reason to think that it would integrate to 1.
The reason it's not a probability or probability density is that it's not defined to be one (in fact its definition involves a potentially different probability density for each point in parameter space).
But I think I know what you're saying -- people need to understand that it's not a probability density in order to avoid making naive probabilistic statements about parameter estimates or confidence regions when their calculations haven't used a prior over the parameters.
I was doing machine learning but never dig into stats before. Then I tried to study Bayesian inference and regression by myself and finally I got what it really means and its importance. First I realised that ubiquity of 'likelihood' and 'likelihood function', then I realised it's just a way to parameterise the model parameters instead of input data. Then MLE is a way to get an estimate of maximum of that function, which is interpreted as the most likely setting to give rise to the data observed.
I know it's not statistically correct but I think it helped a lot in my understanding of other methods....
It is not the most likely setting to give rise to the data observed (that is the posterior), is the setting in which the data observed is the most likely.
Sorry my English in that sentence is probably flawed. I somehow still can't tell the difference very much.
Always found the StatQuest vid on MLE to be extremely beginner friendly. Don't even need college stats or math understanding to get the intuition.
Let's say I'm a huckster that plays a game with you where if I roll a single six sided die and it lands on 1 you lose but you win otherwise.
Let's say you have some guarantee that I'm using the same die each time and that each of the rolls are independent. We play the game ten times and 1 is rolled the first 9 out of 10 times, with a 5 being rolled on the 10th throw. Now, you know that there's a common loaded die that can be purchased that has a weight to skew the probabilities and you further know that the loaded die rolls a 1 80% of the time and the remaining 20% spread evenly to the other values (so 4% for every other value).
Given a choice between the loaded die and the fair die, which is more likely?
The first model, call it $\theta_0$ is the fair die. The second model, with the unfair die, call it $\theta_1$.
The probability of the first model ($\theta_0$) is:
$p( 1,1,1,1,1,1,1,1,1, 5 ; \theta_0) = \frac{1}{6}^9 \cdot \frac{1}{6}$
(approximately .00000008269085843959)
The probability of the second model ($\theta_1$) is:
$p( 1,1,1,1,1,1,1,1,1, 5 ; \theta_1) = (0.8)^9 \cdot 0.2 \cdot \frac{1}{5}$
(or .00536870912000000000)
So we write a computer program to iterate through all the "models" to see which is the more likely. In this case, the iteration goes through two models.
The models can be Gaussians, with model parameters the mean and variance, say, or some other distribution with other parameters to choose from.
For some conditions on models and their parameterization, we might even be able to use more intricate methods that use calculus, gradient descent, etc. to find the MLE.
The MLE formalism is trying to say "given the observation, which parameters fit the best". It gets more complicated because we have to talk about which distributions we're allowing (which "model") and how we parameterize them. In the above, the models are simple, just assigning different probabilities to each of the outcomes of the die rolls and we only have a choice of two parameterizations.
Generally, when we construct models we do so by defining what probability they give to the data. That's a function that takes in your data set and returns some number, the higher the better.
Technically, these functions need to satisfy a bunch of properties, but those properties matter mostly for people doing the business of building and comparing models. If you just have a model someone already made for you, then "the higher the better" is good enough.
It's also the case that these models have "parameters". As a simple example, the model of a coin flip takes in "heads" or "tails" and returns a number. The higher that number, the more probable it claims that outcome to be. When we construct that model, we also choose the "fairness" parameter, usually setting it so that both heads and tails are equally likely.
So really, it's a function both of the data and of its parameters.
Now, "maximum likelihood estimation" (MLE) is just the method where you fix the data inputs to the model to whatever your training data is and then find the parameter inputs that maximize its output. This kind of inverts the normal mechanism where you pick the parameters and then see how probable the data was.
Presumptively, whatever parameterization of your model makes the data the most likely is the parameterization that best represents your data. That doesn't have to be true, and often is only approximately true, but that presumption is exactly what makes MLE popular.
Finally, it's worth describing the origin of the name. When we look at our model after fixing the data inputs and consider it a function of its parameters instead we call that function a "likelihood". This is just another name for "probability" except it's used to emphasize that likelihoods don't meet all the technical properties I skipped up above. So "maximum likelihood estimation" is just the process of estimating the parameters of your model by maximizing the likelihood.
My favorite MLE example: Suppose you walk into a bank and ask them to give you a quarter. You flip the quarter twice and get two heads. Given this experiment, what do you estimate to be the probability p of getting a heads when you flip this coin? Using MLE, you would get p = 1. In other words, this coin will always give you a heads when you flip it! (According to MLE.)
Are you just demonstrating overfitting when estimating using too little data? Or is there something deeper going on in your example? What does the bank have to do with anything?
The example only seems ridiculous because you've deliberately excluded relevant knowledge about the world from the model. Add a prior to the model and you'll have a much more reasonable function to maximise.
To bring things full circle: the cross-entropy loss is the KL divergence. So intuitively, when you're minimizing cross-entropy loss, you're trying to minimize the "divergence" between the true distribution and your model distribution.
This intuition really helped me understand CE loss.
Cross-entropy is not the KL divergence. There is an additional term in cross-entropy which is the entropy of the data distribution (i.e., independent of the model). So, you're right in that minimizing one is equivalent to minimizing the other.
https://stats.stackexchange.com/questions/357963/what-is-the...
Yes, you are totally correct, but I believe this term is omitted from the cross-entropy loss function that is used in machine learning? Because it is a constant which does not contribute to the optimization.
Please correct me if I'm wrong.
This reminds me that David Mackay’s book and his lectures are so excellent on these topics.
- [deleted]
If you have a parametrized functions that imperfectly models a real phenomenon, of course there are errors. Why assume they are random? A better assumption is that your model is just poor. Assuming deterministic modeling errors are due to randomness has always struck me as bizarre.
In the context of MLE, random has a formal definition. What you describe as poor would be included in the mathematics as a factor outside the deterministic parameters that are modeled. E.g. Y = aFactor1 + bFactor2 + ... + constant + 'poor model correction factor'.
To solve the equation, we have to make assumptions of the poor correction factor. These assumptions about the error generally have some 'mathematically nice' qualities. For example it's not predictable or has a trend relating to any other factors. An concrete example is having a mean of zero. If it had a non-zero mean, it should be accounted in the constant factor of the model.
All these mathematically nice assumptions can be summed up be calling the 'poor model correction' factor as random.
Excellent post
The next step is ELBO — the evidence lower bound.