The Profound "Why" Behind Entropy and Logarithms

I've been staring at this formula for weeks: H(p) = -∑ pᵢ log pᵢ
This deceptively simple equation is everywhere—powering the algorithms that compress your Netflix streams, training the AI models that recognize your voice, and measuring how much information you gain from each DNA base pair in genetic sequencing. Shannon entropy quantifies something fundamental: how much information is packed into uncertainty. Remarkably, thermodynamics uses the same logarithmic structure to measure physical entropy—suggesting this mathematical pattern captures something deep about disorder itself.
But here's what's maddening: why does measuring information require a logarithm? Why not something more intuitive like averages or simple ratios? The choice seems arbitrary until you dig deeper and discover that the logarithm isn't just mathematically convenient—it's the only function that makes information behave the way it should.
Most explanations I've found just say "because Shannon defined it that way". But that doesn't satisfy the itch. This formula leads to some of the profound results. So the "why" behind logarithm here should supply at least a week-long dose of dopamine.
What Even Is Information?
Let me step back. What are we actually trying to measure here?
Information seems to be about... "not knowing things". When you learn something surprising, you've gained information. When you learn something you already suspected, you've gained less. When someone tells you something completely predictable, you've gained almost nothing.
So information is tied up with surprise... With uncertainty... With probability...
But how do you quantify "information"?
How to Create a Function to Measure Information
To understand something deeply — whether geometry, probability, or information — you don’t start with an equation.
You start with axioms: simple, intuitive principles that capture the essence of what you’re dealing with. Once you find axioms, you ask:
- What form must the measure take?
- What behaviors are forced by these principles?
- What kind of function would even be fair?
Claude Shannon, when trying to define a mathematical measure of information, followed this exact route. His genius was not just in defining a formula — but in identifying the few irreducible truths.
Let’s walk through them.
Axiom 1: Continuity
If the probabilities of events change slightly, then "information" should also change slightly. There are no "jumps" or "tears" in the function that measures information.
The function $H(p_1, ..., p_n)$ that measures the information out of probabilities $p_1, p_2 ..., p_n$ is continuous in its arguments.
Axiom 2: Monotonicity
If all outcomes are equally likely, then the uncertainty increases as the number of outcomes increases. Therefore, information must follow the same.
$$H\left(\frac{1}{n}, \frac{1}{n}, \ldots, \frac{1}{n}\right) \text{ increases with } n$$
Example: Choosing one from 8 things should feel more uncertain than choosing from 4.
More possible outcomes → more uncertainty.
Axiom 3: Additivity
This is the core axiom, and the one that leads uniquely to the logarithm.
Say you have two events, one will occur with probability $p$ and another will occur with probability $q$. If these two events are independent, then the occurrence of both of these together will be given by probability $p q$.
If we measure information of the combined event (both events happening together) using some function $H$, it will take that combined probabilities as input: $H(pq)$.
But that must be equal to the sum of information carried by each individual event (since they are independent).
This means whatever our function $H$ measures as information, it must satisfy the following:
$$H(pq) = H(p) + H(q)$$
The only continuous function satisfying this equation is:
$H(p) = K log(p)$ for some constant $K$
Proof:
Given the functional equation:
$$f(xy) = f(x) + f(y)$$
Step 1: Differentiate with respect to x
Differentiating both sides with respect to $x$:
$$\frac{\partial}{\partial x}[f(xy)] = \frac{\partial}{\partial x}[f(x) + f(y)]$$
$$f'(xy) \cdot y = f'(x)$$
Step 2: Differentiate with respect to y
Differentiating the result from Step 1 with respect to $y$:
$$\frac{\partial}{\partial y}[f'(xy) \cdot y] = \frac{\partial}{\partial y}[f'(x)]$$
$$f''(xy) \cdot xy + f'(xy) = 0$$
Step 3: Rearrange and substitute
Rearranging the equation:
$$f''(xy) \cdot xy = -f'(xy)$$
$$f''(xy) = -\frac{f'(xy)}{xy}$$
Let $z = xy$, then:
$$f''(z) = -\frac{f'(z)}{z}$$
Step 4: Solve the first-order differential equation
This is a first-order differential equation in $f'(z)$:
$$\frac{df'}{f'} = -\frac{dz}{z}$$
Integrating both sides:
$$\int \frac{df'}{f'} = -\int \frac{dz}{z}$$
$$\log(f') = -\log(z) + C$$
$$\log(f') = \log(z^{-1}) + C$$
Therefore:
$$f'(z) = \frac{K}{z}$$
where $K = e^C$ is a constant.
Step 5: Integrate to get f
Integrating to get $f$:
$$f(z) = \int \frac{A}{z} dz = K \log(z) + C$$
Since putting probability 1 should give us zero information, $C=0$
Therefore:
$$f(z) = K \log(z)$$
for some constant $K$.
From Single Events to Probability Distributions
So we've established that information from individual events must follow $H(p) = K \log(p)$ for some constant $K$. But Shannon entropy isn't about single events—it's about entire probability distributions.
Here's where the magic happens. If we have a system with outcomes having probabilities $p_1, p_2, \ldots, p_n$, what should the total information be? The expected information across all possible outcomes.
Since outcome $i$ occurs with probability $p_i$ and carries information $H(p_i) = K \log(p_i)$, the expected information is:
$$E[H] = \sum_{i=1}^n p_i \cdot K \log(p_i) = K \sum_{i=1}^n p_i \log(p_i)$$
But wait—this gives us negative values when $K > 0$ (since $\log(p_i) < 0$ for $0 < p_i < 1$), and we typically want information to be high when uncertainty is high. Shannon made a brilliant choice: he set $K = -1$ and used logarithm base 2 (Choice of 2 is purely made for convenience that enabled us to work with digital systems ), giving us:
$$H(p) = -\sum_{i=1}^n p_i \log_2(p_i)$$
Why This Choice is Perfect
The negative sign isn't arbitrary—it ensures that:
- High probability events (near certainty, $p \approx 1$) contribute little information ($-\log(1) = 0$)
- Low probability events (surprises, $p \approx 0$) contribute lots of information ($-\log(p) \to \infty$ as $p \to 0$)
The base-2 logarithm gives us information in bits.
The Dopamine Hit
The profound realization is that the logarithm isn't just Shannon's choice—it's forced by the nature of information itself. Any measure of information that behaves sensibly (continuously, additively, monotonically) must be logarithmic.
The universe didn't just happen to work out this way. The structure of uncertainty, probability, and information demands the logarithm. Shannon didn't invent entropy; he discovered its mathematical essence.
That's the week-long dose of dopamine you were after: realizing that you understood the real "why" behind one of the most important formulas in science.