Random Variable
We use a simplified version of a random variable.
Suppose we have a (discrete) random variable \(X\) and let \(x\) be an outcome of \(X\) with probability \(p(x)\).
Information
The information of \(x\) is then:
\(I(x) = -\log p(x)\)
The choice of base does not matter so long as we are consistent.
A common choice is base 2 and then the information is said to be measured in bits.
Note that the information of a message with probability 1 is then \(\log 1 = 0\).
So, an outcome that we can predict is said to contain no information.
Entropy
The definition of (information) entropy is the expected value of the information of its outcomes.
\(H(X) = - \sum_{x \in X} p(x) \log p(x) = \mathbb{E}_{x \in X}[I(x)]\)
Entropy is said to measure the average surprise of the random variable.
Example
Fair Coin
Suppose we toss a fair coin, with a 0.5 probability of heads and a 0.5 probability of tails.
Then the information of heads is \(-\log_2 0.5 = 1\) bits, likewise for tails.
The expected value is thus \((0.5)(1) + (0.5)(1) = 1\) bit.
Totally Unfair Coin
Suppose we toss a coin with a 1.0 probability of heads and a 0.0 probability of tails.
Then the information of heads is \(-\log_2 1.0 = 0\) bits.
However, the information of tails is \(-\log_2 0.0\), is undefined.
Note though that the outcome of tails is impossible, so it is not an outcome.
Thus, the expected value is \((0.0)(1) = 0\) bits - the outcome contains no surprise or average information.
Biased Coin
Based on the above, we might expect the information of a biased coin to be somewhere between 0 and 1 - let’s try it out.
Suppose we toss a coin with a 0.75 probability of heads and a 0.25 probability of tails.
Then the information of heads is \(-\log_2 0.75 \approx 0.415\) bits.
However, the information of tails is \(-\log_2 0.25 = 2\) bits.
Thus, the expected value is \((0.75)(0.415) + (0.25)(2) \approx 0.8\) bits.
So, the random variable has less information.