Why we consider log likelihood instead of Likelihood in Gaussian Distribution

I am reading Gaussian Distribution from a machine learning book. It states that –

We shall determine values for the unknown parameters μ and
σ2 in the Gaussian by maximizing the likelihood function. In practice, it is more convenient to maximize the log of the likelihood
function. Because the logarithm is monotonically increasing function
of its argument, maximization of the log of a function is equivalent
to maximization of the function itself. Taking the log not only
simplifies the subsequent mathematical analysis, but it also helps
numerically because the product of a large number of small
probabilities can easily underflow the numerical precision of the
computer, and this is resolved by computing instead the sum of the log
probabilities.

can anyone give me some intuition behind it with some example? Where the log likelihood is more convenient over likelihood. Please give me a practical example.

Thanks in advance!

Answer

  1. It is extremely useful for example when you want to calculate the
    joint likelihood for a set of independent and identically distributed points. Assuming that you have your points:
    X={x1,x2,,xN} The total likelihood is the product of
    the likelihood for each point, i.e.:
    p(XΘ)=Ni=1p(xiΘ) where Θ are the
    model parameters: vector of means μ and covariance matrix
    Σ. If you use the log-likelihood you will end up with sum
    instead of product: lnp(XΘ)=Ni=1lnp(xiΘ)
  2. Also in the case of Gaussian, it allows you to avoid computation of
    the exponential:

    p(xΘ)=1(2π)ddetΣe12(xμ)TΣ1(xμ)
    Which becomes:

    lnp(xΘ)=d2ln(2π)12ln(detΣ)12(xμ)TΣ1(xμ)

  3. Like you mentioned lnx is a monotonically increasing function,
    thus log-likelihoods have the same relations of order as the
    likelihoods:

    p(xΘ1)>p(xΘ2)lnp(xΘ1)>lnp(xΘ2)

  4. From a standpoint of computational complexity, you can imagine that first
    of all summing is less expensive than multiplication (although nowadays these are almost equal). But what is
    even more important, likelihoods would become very small and you
    will run out of your floating point precision very quickly, yielding
    an underflow. That’s why it is way more convenient to use the
    logarithm of the likelihood. Simply try to calculate the likelihood by hand, using pocket calculator – almost impossible.

    Additionally in the classification framework you can simplify
    calculations even further. The relations of order will remain valid
    if you drop the division by 2 and the dln(2π) term. You can do
    that because these are class independent. Also, as one might notice if
    variance of both classes is the same (Σ1=Σ2), then
    you can also remove the ln(detΣ) term.

Attribution
Source : Link , Question Author : Kaidul Islam , Answer Author : Michael Hardy

Leave a Comment