Ian McLoughlin

Random Variable

We use a simplified version of a random variable.

Suppose we have a (discrete) random variable \(X\) and let \(x\) be an outcome of \(X\) with probability \(p(x)\).

The information of \(x\) is then:

\(I(x) = -\log p(x)\)

The choice of base does not matter so long as we are consistent.

A common choice is base 2 and then the information is said to be measured in bits.

Note that the information of a message with probability 1 is then \(\log 1 = 0\).

So, an outcome that we can predict is said to contain no information.

The definition of (information) entropy is the expected value of the information of its outcomes.

\(H(X) = - \sum_{x \in X} p(x) \log p(x) = \mathbb{E}_{x \in X}[I(x)]\)

Entropy is said to measure the average surprise of the random variable.

Suppose we toss a fair coin, with a 0.5 probability of heads and a 0.5 probability of tails.

Then the information of heads is \(-\log_2 0.5 = 1\) bits, likewise for tails.

The expected value is thus \((0.5)(1) + (0.5)(1) = 1\) bit.

Suppose we toss a coin with a 1.0 probability of heads and a 0.0 probability of tails.

Then the information of heads is \(-\log_2 1.0 = 0\) bits.

However, the information of tails is \(-\log_2 0.0\), is undefined.

Note though that the outcome of tails is impossible, so it is not an outcome.

Thus, the expected value is \((0.0)(1) = 0\) bits - the outcome contains no surprise or average information.

Based on the above, we might expect the information of a biased coin to be somewhere between 0 and 1 - let’s try it out.

Suppose we toss a coin with a 0.75 probability of heads and a 0.25 probability of tails.

Then the information of heads is \(-\log_2 0.75 \approx 0.415\) bits.

However, the information of tails is \(-\log_2 0.25 = 2\) bits.

Thus, the expected value is \((0.75)(0.415) + (0.25)(2) \approx 0.8\) bits.

So, the random variable has less information.