godbolt.org
Ataraxia through Epoché
Symphilosophein 狂者進取,狷者有所不為也。
Feb 3, 2026
Jan 14, 2026
[C++] Cache-friendly
Just always having int as default if usage is under contract.
Reference:
Object Lifetimes reading minute
alignas(int) unsigned char buffer[sizeof(int)]; // used for provenance contract.
alignas(int) unsigned char buffer[sizeof(int)]; // used for provenance contract.
One byte types:
char8_t is faster due to optimizer.
this is that a pointer of char, unsigned char, or std::byte type can point to anywhere, thus
data.size() could potentially being modified inside the loop, thus accessing data.size() is necessary.
As for char8_t has no this privilege, thus data.size() is fixed and stored inside a register.
As for char8_t has no this privilege, thus data.size() is fixed and stored inside a register.
also, under meta-programming, type inference of those bitfields type has to be casted.
Reference:
L2 cache is used for data and instructions.
e.g.
slower, due to branch and if branch succeed, instructions are replaced in the cache.
slower, due to branch and if branch succeed, instructions are replaced in the cache.
Reference:
Always start with DoD; focus on algorithm.
Reference:
Align with the cache line to avoid false sharing
All-in-all summary
Avoid long branches:
// BAD
void process_data(Data* d) {
if (!d) {
// LONG BRANCH / COLD CODE
// Imagine 100 lines of logging, stack tracing,
// and complex error recovery logic here.
log_error("Null pointer detected...");
cleanup_subsystems();
notify_admin_via_snmp();
throw std::runtime_code("Fatal Error");
}
// This "hot" code is now physically far away from the 'if'
// check in the compiled binary.
d->value += 42;
d->status = Ready;
}
// Better, avoid I-Cache miss
// Move cold logic to a non-inline function
[[noreturn]] void handle_fatal_error() {
log_error("Null pointer detected...");
cleanup_subsystems();
notify_admin_via_snmp();
throw std::runtime_code("Fatal Error");
}
void process_data_optimized(Data* d) {
// C++20 [[likely]]/[[unlikely]] attributes guide the compiler
if (!d) [[unlikely]] {
handle_fatal_error(); // A jump to a far-away location happens ONLY on error
}
// This code is now physically adjacent to the 'if' check
d->value += 42;
d->status = Ready;
}
Jan 5, 2026
[6.S191][note] Introduction to Deep Learning
Why activation functions?
introduce non-linearities into the network.
The loss of network measures the cost incurred from incorrect predictions.
Cross entropy loss can be used with models that output a probability between 0 and 1
- used due to log with probability is negative, thus we apply - to flip to positive value.
Mean squared error loss can be used with regression models that output continuous real numbers.
Training the network
Focus on Loss Optimization
We want to find the network weights that achieve the lowest lose.
Here's the gradient decent comes in. (back propagation)
Thus, by using derivative(i.e. m, slope), we can see if the m is >0 or < 0. If > 0, we lower the weight. < 0, we increase the weight.
So how much the steps to take? (learning rate)
set too small, slow, too large, unstable.
set too small, slow, too large, unstable.
Methods:
Jan 4, 2026
[AI math] Proximal Policy Optimization (PPO)
Proximal Policy Optimization(PPO 2017)
It tries to solve destructive updates issue.
In Reinforcement Learning, if you take a step that is too large, the agent "falls off a cliff." It adopts a policy that causes it to fail immediately, which generates bad data, which leads to even worse updates. It never recovers.
The Deep Dive: The Mechanics of "Clipping"
The genius of PPO lies in its pessimistic view of the world. It doesn't trust its own recent success.
When the agent collects data, it calculates the Ratio between the new policy and the old policy for a specific action:
The "Pessimistic" Update
Standard algorithms simply push this ratio as high as possible for good actions. PPO says: "Stop."
- It looks at the Advantage (how good the action was) and applies a Clip:
If the action was good (Positive Advantage): PPO increases the probability of doing it again, BUT it caps the Ratio at roughly 1.2 (increasing probability by 20%).
Even if the math says "this action was amazing, boost it by 500%," PPO clips the update at 20%. It refuses to overcommit to a single piece of evidence. - If the action was bad (Negative Advantage):
PPO decreases the probability, BUT it caps the decrease at 0.8. It refuses to completely destroy the possibility of taking that action again based on one failure.
This results in a "Trust Region"—a safe zone around the current behavior where the agent is free to learn, but beyond which it is physically prevented from moving.
Usage
Reinforcement Learning from Human Feedback (RLHF).
The Setup
- The Agent (Policy): The LLM itself (e.g., Llama-2-70b). Its "Action" is choosing the next token (word) in a sentence.
- The Environment: The conversation window.
- The Reward Model: A separate, smaller AI that has been trained to mimic a human grader. It looks at a full sentence and outputs a score (e.g., 7/10 for helpfulness).
The PPO Training Loop
- Rollout (Data Collection): The LLM generates a response to a prompt like "Explain gravity."
LLM: "Gravity is a force..." It records the probability of every token it chose (e.g., it was 99% sure about "force"). - Advantage Calculation: The Reward Model looks at the finished sentence and gives it a score (Reward). The PPO algorithm compares this score to what it expected to get.
- Scenario: The model usually writes boring answers (Expected Reward: 5). This answer was witty and accurate (Actual Reward: 8).
- Advantage: +3. This was a "better than expected" sequence of actions.
- The PPO Update (The Critical Step): We now update the LLM's neural weights to make those specific tokens more likely next time.
- Without PPO: The model might see that high reward and drastically boost the probability of those specific words, potentially overfitting and making the model speak in repetitive loops or gibberish just to chase that score.
- With PPO: The algorithm checks the Ratio.
"Did we already increase the probability of the word 'force' by 20% compared to the old model?"
Yes? -> CLIP. Stop updating. Do not push the weights further.
PPO is not just an optimization method; it is a constraint method.
It allows AI to run training loops that would otherwise be unstable.
It allows AI to run training loops that would otherwise be unstable.
Policy
Policy is a function that maps a State (what the agent sees) to an Action (what the agent does).
In mathematical formulas, the policy is almost always represented by the Greek letter π.
a = π(s)
s: State(input)
π: Policy(logic)
a: Action(output)
The Two Types of Policies
- Deterministic Policy
This policy has no randomness. For a specific situation, it will always output the exact same action.
Example: A chess bot. If the board is in arrangement X, it always moves the Knight to E5
a = π(s) - Stochastic Policy (Used in PPO)
This policy deals in probabilities. Instead of outputting a single action, it outputs a probability distribution over all possible actions. The agent then samples from this distribution.
Example: A robot learning to walk. If it is tilting left, it might be 80% likely to step right and 20% likely to wave its arm. It rolls the dice to decide.
π(a | s) : read as "the probability of taking action a given state s
In modern AI (like PPO), the "Policy" is a Neural Network.
- Input: The network receives the State (e.g., the pixels of a video game screen, or the text of a user prompt).
- Hidden Layers: The network processes this information.
- Output: The network outputs numbers representing the probability of each action.
Action 1 (Jump): 0.1
Action 2 (Run): 0.8
Action 3 (Duck): 0.1
When we say we are "training the policy," we are simply adjusting the weights of this neural network so that it assigns higher probabilities to "good" actions and lower probabilities to "bad" actions.
Value Function (V)
often called the Critic, is the partner to the Policy (the Actor).
While the Policy answers "What should I do?", the Value Function answers: "How good is it to be in this situation?"
- 1. The Core Concept: Prediction
Imagine you are playing a video game.
The Policy (Actor): Looks at the screen and presses the "Jump" button.
The Value Function (Critic): Looks at the screen and says, "We currently have a 70% chance of winning."
The Critic doesn't play the game; it predicts the outcome. - The Math V(s)
The Value function maps a State s to a single number (Scalar). - Why PPO Needs the Critic (The Advantage)This is the most important part. PPO updates the Policy based on the Advantage. The Advantage is calculated using the Value Function.
The Logic: To know if an action was "good," we can't just look at the reward.
Example: If you get a reward of +10, is that good?
If you usually get +1, then +10 is amazing.
If you usually get +100, then +10 is terrible.
The Value Function provides that baseline "usual" expectation. - How the Critic LearnsWhile the Actor learns to maximize reward, the Critic learns to be a better predictor.
- The PPO Team:
Actor (Policy π): "I think we should move Left."
Environment: (Agent moves Left, gets +5 reward, lands in new state).
Critic (Value V): "I expected a reward of +2. You got +5. That was a great move!"
PPO Update: "Since the Critic said that was 'great' (Positive Advantage), let's adjust the Actor's weights to make 'Move Left' more likely next time—but clip it so we don't go crazy."
Jan 3, 2026
[C++26] std::execution
Reference:
https://en.cppreference.com/w/cpp/experimen
https://www.virjacode.com/papers/p3288.htm
code:
constexpr explicit operation_state(Range&& r, Receiver rcvr)
: /* ... */
states_(
std::from_range,
std::ranges::views::transform(
std::forward<Range>(r),
[&]<typename U> requires
std::constructible_from<ender, U>(U&& u)
{
return elide([&, s = std::forward<U>(u)]() mutable {
return state_(*this, std::move(s));
});
})),
/* ... */
{}
[C++] Equality preservation
Reference:
Equality preservation
Equality Preservation is a rule that says a function (or expression) must be "well-behaved" and "predictable."
It essentially means: If you give it the same inputs, you should get the same outputs.
- The Core Idea: "No Surprises"
If we have two objects, a and b, and they are equal (a == b), then doing the same thing to both of them should result in the same outcome.
Equal Inputs -> Equal Results: If f(a) returns 5, and a == b, then f(b) must also return 5.
Equal Inputs -> Equal Side Effects: If calling f(a) changes a to 10, then calling f(b) must change b to 10.
- The Requirement of "Stability"
The standard also requires equality-preserving expressions to be stable. This means if we haven't changed the object ourselves, calling the function twice on the same object must give the same result both times.
Example of something NOT stable:
A function that returns a random number or a function that uses a global counter that increments every time you call it. i.e. has no Referential transparency
A function that returns a random number or a function that uses a global counter that increments every time you call it. i.e. has no Referential transparency
int counter = 0;
int bad_function(int x) { return x + (counter++); } // NOT equality-preserving
By requiring concepts to be equality-preserving, C++ guarantees that:
- The algorithm can trust the values it reads.
- The algorithm can safely make copies of objects and expect the copies to behave exactly like the originals.
Unless noted otherwise, every expression used in a requires expression of the standard library concepts is required to be equality preserving, and the evaluation of the expression may modify only its non-constant operands. Operands that are constant must not be modified.
In the standard library, the following concepts are allowed to have non equality-preserving requires expressions:
- output_iterator
- indirectly_writable
- invocable
- weakly_incrementable
- range
Subscribe to:
Comments (Atom)

