Random Variable
We use a simplified version of a random variable.
Suppose we have a (discrete) random variable $X$ and let $x$ be an outcome of $X$ with probability $p(x)$.
Information
The information of $x$ is then:
$I(x) = -\log p(x)$
The choice of base does not matter so long as we are consistent.
A common choice is base 2 and then the information is said to be measured in bits.
Note that the information of a message with probability 1 is then $\log 1 = 0$.
So, an outcome that we can predict is said to contain no information.
Entropy
The definition of (information) entropy is the expected value of the information of its outcomes.
$ H(X) = - \sum_{x \in X} p(x) \log p(x) = \mathbb{E}_{x \in X}[I(x)]$
Entropy is said to measure the average surprise of the random variable.
Example
Fair Coin
Suppose we toss a fair coin, with a 0.5 probability of heads and a 0.5 probability of tails.
Then the information of heads is $-\log_2 0.5 = 1$ bits, likewise for tails.
The expected value is thus $(0.5)(1) + (0.5)(1) = 1$ bit.
Totally Unfair Coin
Suppose we toss a coin with a 1.0 probability of heads and a 0.0 probability of tails.
Then the information of heads is $-\log_2 1.0 = 0$ bits.
However, the information of tails is $-\log_2 0.0$, is undefined.
Note though that the outcome of tails is impossible, so it is not an outcome.
Thus, the expected value is $(0.0)(1) = 0$ bits - the outcome contains no surprise or average information.
Biased Coin
Based on the above, we might expect the information of a biased coin to be somewhere between 0 and 1 - let’s try it out.
Suppose we toss a coin with a 0.75 probability of heads and a 0.25 probability of tails.
Then the information of heads is $-\log_2 0.75 \approx 0.415$ bits.
However, the information of tails is $-\log_2 0.25 = 2$ bits.
Thus, the expected value is $(0.75)(0.415) + (0.25)(2) \approx 0.8$ bits.
So, the random variable has less information.