I have bumped many times into entropy, but it has never been clear for me why we use this formula:

If X is random variable then its entropy is:

H(X) = -\displaystyle\sum_{x} p(x)\log p(x).

Why are we using this formula? Where did this formula come from? I’m looking for the intuition. Is it because this function just happens to have some good analytical and practical properties? Is it just because it works? Where did Shannon get this from? Did he sit under a tree and entropy fell to his head like the apple did for Newton? How do you interpret this quantity in the real physical world?

**Answer**

Here’s one mildly informal answer.

How surprising is an event? Informally, the lower probability you would’ve assigned to an event, the more surprising it is, so surprise seems to be some kind of decreasing function of probability. It’s reasonable to ask that it be continuous in the probability. And if event A has a certain amount of surprise, and event B has a certain amount of surprise, and you observe them together, and they’re independent, it’s reasonable that the amount of surprise adds.

From here it follows that the surprise you feel at event A happening must be a positive constant multiple of – \log \mathbb{P}(A) (exercise; this is related to the Cauchy functional equation). Taking surprise to just be – \log \mathbb{P}(A), it follows that the entropy of a random variable is its **expected surprise**, or in other words it measures how surprised you expect to be on average after sampling it.

Closely related is Shannon’s source coding theorem, if you think of – \log \mathbb{P}(A) as a measure of how many bits you need to tell someone that A happened.

**Attribution***Source : Link , Question Author : jjepsuomi , Answer Author : Qiaochu Yuan*