Proximal Policy Optimization(PPO 2017)
It tries to solve destructive updates issue.
In Reinforcement Learning, if you take a step that is too large, the agent "falls off a cliff." It adopts a policy that causes it to fail immediately, which generates bad data, which leads to even worse updates. It never recovers.
The Deep Dive: The Mechanics of "Clipping"
The genius of PPO lies in its pessimistic view of the world. It doesn't trust its own recent success.
When the agent collects data, it calculates the Ratio between the new policy and the old policy for a specific action:
The "Pessimistic" Update
Standard algorithms simply push this ratio as high as possible for good actions. PPO says: "Stop."
- It looks at the Advantage (how good the action was) and applies a Clip:
If the action was good (Positive Advantage): PPO increases the probability of doing it again, BUT it caps the Ratio at roughly 1.2 (increasing probability by 20%).
Even if the math says "this action was amazing, boost it by 500%," PPO clips the update at 20%. It refuses to overcommit to a single piece of evidence. - If the action was bad (Negative Advantage):
PPO decreases the probability, BUT it caps the decrease at 0.8. It refuses to completely destroy the possibility of taking that action again based on one failure.
This results in a "Trust Region"—a safe zone around the current behavior where the agent is free to learn, but beyond which it is physically prevented from moving.
Usage
Reinforcement Learning from Human Feedback (RLHF).
The Setup
- The Agent (Policy): The LLM itself (e.g., Llama-2-70b). Its "Action" is choosing the next token (word) in a sentence.
- The Environment: The conversation window.
- The Reward Model: A separate, smaller AI that has been trained to mimic a human grader. It looks at a full sentence and outputs a score (e.g., 7/10 for helpfulness).
The PPO Training Loop
- Rollout (Data Collection): The LLM generates a response to a prompt like "Explain gravity."
LLM: "Gravity is a force..." It records the probability of every token it chose (e.g., it was 99% sure about "force").
- Advantage Calculation: The Reward Model looks at the finished sentence and gives it a score (Reward). The PPO algorithm compares this score to what it expected to get.
- Scenario: The model usually writes boring answers (Expected Reward: 5). This answer was witty and accurate (Actual Reward: 8).
- Advantage: +3. This was a "better than expected" sequence of actions.
- The PPO Update (The Critical Step): We now update the LLM's neural weights to make those specific tokens more likely next time.
- Without PPO: The model might see that high reward and drastically boost the probability of those specific words, potentially overfitting and making the model speak in repetitive loops or gibberish just to chase that score.
- With PPO: The algorithm checks the Ratio.
"Did we already increase the probability of the word 'force' by 20% compared to the old model?"
Yes? -> CLIP. Stop updating. Do not push the weights further.
PPO is not just an optimization method; it is a constraint method.
It allows AI to run training loops that would otherwise be unstable.
Policy
Policy is a function that maps a State (what the agent sees) to an Action (what the agent does).
In mathematical formulas, the policy is almost always represented by the Greek letter π.
a = π(s)
s: State(input)
π: Policy(logic)
a: Action(output)
The Two Types of Policies
- Deterministic Policy
This policy has no randomness. For a specific situation, it will always output the exact same action.
Example: A chess bot. If the board is in arrangement X, it always moves the Knight to E5
a = π(s) - Stochastic Policy (Used in PPO)
This policy deals in probabilities. Instead of outputting a single action, it outputs a probability distribution over all possible actions. The agent then samples from this distribution.
Example: A robot learning to walk. If it is tilting left, it might be 80% likely to step right and 20% likely to wave its arm. It rolls the dice to decide.
π(a | s) : read as "the probability of taking action a given state s
Stochastic policies are essential for learning. If a policy is 100% deterministic, the agent never tries anything new; it just repeats the same mistakes. By adding randomness (probabilities), the agent explores different actions, which allows it to discover better strategies.
In modern AI (like PPO), the "Policy" is a Neural Network.
- Input: The network receives the State (e.g., the pixels of a video game screen, or the text of a user prompt).
- Hidden Layers: The network processes this information.
- Output: The network outputs numbers representing the probability of each action.
Action 1 (Jump): 0.1
Action 2 (Run): 0.8
Action 3 (Duck): 0.1
When we say we are "training the policy," we are simply adjusting the weights of this neural network so that it assigns higher probabilities to "good" actions and lower probabilities to "bad" actions.
Value Function (V)
often called the Critic, is the partner to the Policy (the Actor).
While the Policy answers "What should I do?", the Value Function answers: "How good is it to be in this situation?"
- 1. The Core Concept: Prediction
Imagine you are playing a video game.
The Policy (Actor): Looks at the screen and presses the "Jump" button.
The Value Function (Critic): Looks at the screen and says, "We currently have a 70% chance of winning."
The Critic doesn't play the game; it predicts the outcome. - The Math V(s)
The Value function maps a State s to a single number (Scalar).
Why PPO Needs the Critic (The Advantage)
This is the most important part. PPO updates the Policy based on the Advantage. The Advantage is calculated using the Value Function.
The Logic: To know if an action was "good," we can't just look at the reward.
Example: If you get a reward of +10, is that good?
If you usually get +1, then +10 is amazing.
If you usually get +100, then +10 is terrible.
The Value Function provides that baseline "usual" expectation.
How the Critic Learns
While the Actor learns to maximize reward, the Critic learns to be a better predictor.
It uses a simple regression loss function (Mean Squared Error):

At every step, the Critic looks at the reward actually received and updates itself: "I predicted this state was worth 5 points, but we actually got 8. I should update my weights to predict higher next time."
The PPO Team:
Actor (Policy π): "I think we should move Left."
Environment: (Agent moves Left, gets +5 reward, lands in new state).
Critic (Value V): "I expected a reward of +2. You got +5. That was a great move!"
PPO Update: "Since the Critic said that was 'great' (Positive Advantage), let's adjust the Actor's weights to make 'Move Left' more likely next time—but clip it so we don't go crazy."