Information Entropy

ianmcloughlin.github.io linkedin github

Random Variable

We use a simplified version of a random variable.

Suppose we have a (discrete) random variable $X$ and let $x$ be an outcome of $X$ with probability $p(x)$.

Information

The information of $x$ is then:

$I(x) = -\log p(x)$

The choice of base does not matter so long as we are consistent.

A common choice is base 2 and then the information is said to be measured in bits.

Note that the information of a message with probability 1 is then $\log 1 = 0$.

So, an outcome that we can predict is said to contain no information.

Entropy

The definition of (information) entropy is the expected value of the information of its outcomes.

$ H(X) = - \sum_{x \in X} p(x) \log p(x) = \mathbb{E}_{x \in X}[I(x)]$

Entropy is said to measure the average surprise of the random variable.

Example

Fair Coin

Suppose we toss a fair coin, with a 0.5 probability of heads and a 0.5 probability of tails.

Then the information of heads is $-\log_2 0.5 = 1$ bits, likewise for tails.

The expected value is thus $(0.5)(1) + (0.5)(1) = 1$ bit.

Totally Unfair Coin

Suppose we toss a coin with a 1.0 probability of heads and a 0.0 probability of tails.

Then the information of heads is $-\log_2 1.0 = 0$ bits.

However, the information of tails is $-\log_2 0.0$, is undefined.

Note though that the outcome of tails is impossible, so it is not an outcome.

Thus, the expected value is $(0.0)(1) = 0$ bits - the outcome contains no surprise or average information.

Biased Coin

Based on the above, we might expect the information of a biased coin to be somewhere between 0 and 1 - let’s try it out.

Suppose we toss a coin with a 0.75 probability of heads and a 0.25 probability of tails.

Then the information of heads is $-\log_2 0.75 \approx 0.415$ bits.

However, the information of tails is $-\log_2 0.25 = 2$ bits.

Thus, the expected value is $(0.75)(0.415) + (0.25)(2) \approx 0.8$ bits.

So, the random variable has less information.