Dec 31, 2025

[AI Math] Kullback-Leibler (KL) Divergence

Kullback-Leibler (KL) Divergence explained.

Kullback-Leibler (KL) Divergence is the mathematical foundation that motivated PPO, but in the most popular version of PPO (PPO-Clip), it is actually not explicitly calculated.

KL Divergence measures the distance between two probability distributions.
If you have two policies:

  • Old Policy (π(old)): Says "Jump" with 70% probability.
  • New Policy (π(new)): Says "Jump" with 71% probability.
KL Divergence: Very low (The policies are effectively the same).

If the New Policy says "Jump" with 10% probability:KL Divergence: Very high (The "Brain" has completely changed its personality).




  • P(x): The "true" or target distribution.
  • Q(x): The approximation or model distribution.
  • Important Note: KL Divergence is non-symmetric.
    This means D_{KL}(P || Q) != D_{KL}(Q || P).
  • The ratio P(x)/Q(x) tells us how much more likely an event is under P than Q.
  • The log turns ratio into an "information difference."
  • The P(x) outside the log acts as a weight, ensuring we care more about the differences in areas where the "true" distribution P actually occurs.


Key Characteristics to Remember

  • Always Non-negative: D_{KL}(P || Q) >= 0. It is only 0 if P and Q are identical.
  • Direction Matters: Because it is non-symmetric, it is a "divergence," not a "distance."
  • Absolute Continuity: If Q(x) = 0 for some x where P(x) > 0, the KL Divergence goes to infinity. In code, this often results in a NaN or inf error.

Information Theory

In information theory, the "surprisal" or information content of an event is defined as
h(x) = log(1/p(x)) or -log(p(x)).
Low probability events have high surprise (high information).
High probability events have low surprise (low information).
KL Divergence measures the difference in surprise. By using log(P/Q), which is the same as
log(P) - log(Q), we are calculating the difference between the information content of the two distributions.


Turning Multiplication into Addition

Probabilities are multiplicative. If we want to find the joint probability of independent events, we multiply them: P(A and B) = P(A) x P(B).
However, math is much easier to perform (especially in calculus and optimization) when we deal with sums rather than products. The logarithm turns multiplication into addition: log(A x B) = log(A) + log(B)

This makes finding the gradient (the slope) much simpler for Machine Learning algorithms during training.

Handling Extreme Scales

Probabilities are always between 0 and 1.
When we multiply many small probabilities together (like in a deep neural network), the numbers become infinitesimally small (e.g., 0.0000000001), leading to numerical underflow where a computer just rounds the number to zero.
Logarithms map these tiny decimals to a much more manageable range of negative numbers.
P = 0.1 => log(P) = -1
P = 0.000001 => log(P) = -6

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.