Ataraxia through Epoché
Symphilosophein 狂者進取,狷者有所不為也。
Mar 3, 2026
[C++] Elaborated type specifier
Reference:
https://en.cppreference.com/w/cpp/language/elaborated_type_specifier.html
https://en.cppreference.com/w/cpp/language/elaborated_type_specifier.html
https://vsdmars.blogspot.com/2014/02/c11-extended-friend-declaration.html
class T
{
public:
class U;
private:
int U;
};
int main()
{
int T;
T t; // error: the local variable T is found
class T t; // OK: finds ::T, the local variable T is ignored
T::U* u; // error: lookup of T::U finds the private data member
class T::U* u; // OK: the data member is ignored
}
template<typename T>
struct Node
{
struct Node* Next; // OK: lookup of Node finds the injected-class-name
struct Data* Data; // OK: declares type Data at global scope
// and also declares the data member Data
friend class ::List; // error: cannot introduce a qualified name
enum Kind* kind; // error: cannot introduce an enum
};
Data* p; // OK: struct Data has been declared
template<typename T>
class Node
{
friend class T; // error: type parameter cannot appear in an elaborated type specifier;
// note that similar declaration `friend T;` is OK.
};
class A {};
enum b { f, t };
int main()
{
class A a; // OK: equivalent to 'A a;'
enum b flag; // OK: equivalent to 'b flag;'
}
enum class E { a, b };
enum E x = E::a; // OK
enum class E y = E::b; // error: 'enum class' cannot introduce an elaborated type specifier
struct A {};
class A a; // OK
`Feb 3, 2026
Jan 14, 2026
[C++] Cache-friendly
Just always having int as default if usage is under contract.
Reference:
Object Lifetimes reading minute
alignas(int) unsigned char buffer[sizeof(int)]; // used for provenance contract.
alignas(int) unsigned char buffer[sizeof(int)]; // used for provenance contract.
One byte types:
char8_t is faster due to optimizer.
this is that a pointer of char, unsigned char, or std::byte type can point to anywhere, thus
data.size() could potentially being modified inside the loop, thus accessing data.size() is necessary.
As for char8_t has no this privilege, thus data.size() is fixed and stored inside a register.
As for char8_t has no this privilege, thus data.size() is fixed and stored inside a register.
also, under meta-programming, type inference of those bitfields type has to be casted.
Reference:
L2 cache is used for data and instructions.
e.g.
slower, due to branch and if branch succeed, instructions are replaced in the cache.
slower, due to branch and if branch succeed, instructions are replaced in the cache.
Reference:
Always start with DoD; focus on algorithm.
Reference:
Align with the cache line to avoid false sharing
All-in-all summary
Avoid long branches:
// BAD
void process_data(Data* d) {
if (!d) {
// LONG BRANCH / COLD CODE
// Imagine 100 lines of logging, stack tracing,
// and complex error recovery logic here.
log_error("Null pointer detected...");
cleanup_subsystems();
notify_admin_via_snmp();
throw std::runtime_code("Fatal Error");
}
// This "hot" code is now physically far away from the 'if'
// check in the compiled binary.
d->value += 42;
d->status = Ready;
}
// Better, avoid I-Cache miss
// Move cold logic to a non-inline function
[[noreturn]] void handle_fatal_error() {
log_error("Null pointer detected...");
cleanup_subsystems();
notify_admin_via_snmp();
throw std::runtime_code("Fatal Error");
}
void process_data_optimized(Data* d) {
// C++20 [[likely]]/[[unlikely]] attributes guide the compiler
if (!d) [[unlikely]] {
handle_fatal_error(); // A jump to a far-away location happens ONLY on error
}
// This code is now physically adjacent to the 'if' check
d->value += 42;
d->status = Ready;
}
Jan 5, 2026
[6.S191][note] Introduction to Deep Learning
Why activation functions?
introduce non-linearities into the network.
The loss of network measures the cost incurred from incorrect predictions.
Cross entropy loss can be used with models that output a probability between 0 and 1
- used due to log with probability is negative, thus we apply - to flip to positive value.
Mean squared error loss can be used with regression models that output continuous real numbers.
Training the network
Focus on Loss Optimization
We want to find the network weights that achieve the lowest lose.
Here's the gradient decent comes in. (back propagation)
Thus, by using derivative(i.e. m, slope), we can see if the m is >0 or < 0. If > 0, we lower the weight. < 0, we increase the weight.
So how much the steps to take? (learning rate)
set too small, slow, too large, unstable.
set too small, slow, too large, unstable.
Methods:
Jan 4, 2026
[AI math] Proximal Policy Optimization (PPO)
Proximal Policy Optimization(PPO 2017)
It tries to solve destructive updates issue.
In Reinforcement Learning, if you take a step that is too large, the agent "falls off a cliff." It adopts a policy that causes it to fail immediately, which generates bad data, which leads to even worse updates. It never recovers.
The Deep Dive: The Mechanics of "Clipping"
The genius of PPO lies in its pessimistic view of the world. It doesn't trust its own recent success.
When the agent collects data, it calculates the Ratio between the new policy and the old policy for a specific action:
The "Pessimistic" Update
Standard algorithms simply push this ratio as high as possible for good actions. PPO says: "Stop."
- It looks at the Advantage (how good the action was) and applies a Clip:
If the action was good (Positive Advantage): PPO increases the probability of doing it again, BUT it caps the Ratio at roughly 1.2 (increasing probability by 20%).
Even if the math says "this action was amazing, boost it by 500%," PPO clips the update at 20%. It refuses to overcommit to a single piece of evidence. - If the action was bad (Negative Advantage):
PPO decreases the probability, BUT it caps the decrease at 0.8. It refuses to completely destroy the possibility of taking that action again based on one failure.
This results in a "Trust Region"—a safe zone around the current behavior where the agent is free to learn, but beyond which it is physically prevented from moving.
Usage
Reinforcement Learning from Human Feedback (RLHF).
The Setup
- The Agent (Policy): The LLM itself (e.g., Llama-2-70b). Its "Action" is choosing the next token (word) in a sentence.
- The Environment: The conversation window.
- The Reward Model: A separate, smaller AI that has been trained to mimic a human grader. It looks at a full sentence and outputs a score (e.g., 7/10 for helpfulness).
The PPO Training Loop
- Rollout (Data Collection): The LLM generates a response to a prompt like "Explain gravity."
LLM: "Gravity is a force..." It records the probability of every token it chose (e.g., it was 99% sure about "force"). - Advantage Calculation: The Reward Model looks at the finished sentence and gives it a score (Reward). The PPO algorithm compares this score to what it expected to get.
- Scenario: The model usually writes boring answers (Expected Reward: 5). This answer was witty and accurate (Actual Reward: 8).
- Advantage: +3. This was a "better than expected" sequence of actions.
- The PPO Update (The Critical Step): We now update the LLM's neural weights to make those specific tokens more likely next time.
- Without PPO: The model might see that high reward and drastically boost the probability of those specific words, potentially overfitting and making the model speak in repetitive loops or gibberish just to chase that score.
- With PPO: The algorithm checks the Ratio.
"Did we already increase the probability of the word 'force' by 20% compared to the old model?"
Yes? -> CLIP. Stop updating. Do not push the weights further.
PPO is not just an optimization method; it is a constraint method.
It allows AI to run training loops that would otherwise be unstable.
It allows AI to run training loops that would otherwise be unstable.
Policy
Policy is a function that maps a State (what the agent sees) to an Action (what the agent does).
In mathematical formulas, the policy is almost always represented by the Greek letter π.
a = π(s)
s: State(input)
π: Policy(logic)
a: Action(output)
The Two Types of Policies
- Deterministic Policy
This policy has no randomness. For a specific situation, it will always output the exact same action.
Example: A chess bot. If the board is in arrangement X, it always moves the Knight to E5
a = π(s) - Stochastic Policy (Used in PPO)
This policy deals in probabilities. Instead of outputting a single action, it outputs a probability distribution over all possible actions. The agent then samples from this distribution.
Example: A robot learning to walk. If it is tilting left, it might be 80% likely to step right and 20% likely to wave its arm. It rolls the dice to decide.
π(a | s) : read as "the probability of taking action a given state s
In modern AI (like PPO), the "Policy" is a Neural Network.
- Input: The network receives the State (e.g., the pixels of a video game screen, or the text of a user prompt).
- Hidden Layers: The network processes this information.
- Output: The network outputs numbers representing the probability of each action.
Action 1 (Jump): 0.1
Action 2 (Run): 0.8
Action 3 (Duck): 0.1
When we say we are "training the policy," we are simply adjusting the weights of this neural network so that it assigns higher probabilities to "good" actions and lower probabilities to "bad" actions.
Value Function (V)
often called the Critic, is the partner to the Policy (the Actor).
While the Policy answers "What should I do?", the Value Function answers: "How good is it to be in this situation?"
- 1. The Core Concept: Prediction
Imagine you are playing a video game.
The Policy (Actor): Looks at the screen and presses the "Jump" button.
The Value Function (Critic): Looks at the screen and says, "We currently have a 70% chance of winning."
The Critic doesn't play the game; it predicts the outcome. - The Math V(s)
The Value function maps a State s to a single number (Scalar). - Why PPO Needs the Critic (The Advantage)This is the most important part. PPO updates the Policy based on the Advantage. The Advantage is calculated using the Value Function.
The Logic: To know if an action was "good," we can't just look at the reward.
Example: If you get a reward of +10, is that good?
If you usually get +1, then +10 is amazing.
If you usually get +100, then +10 is terrible.
The Value Function provides that baseline "usual" expectation. - How the Critic LearnsWhile the Actor learns to maximize reward, the Critic learns to be a better predictor.
- The PPO Team:
Actor (Policy π): "I think we should move Left."
Environment: (Agent moves Left, gets +5 reward, lands in new state).
Critic (Value V): "I expected a reward of +2. You got +5. That was a great move!"
PPO Update: "Since the Critic said that was 'great' (Positive Advantage), let's adjust the Actor's weights to make 'Move Left' more likely next time—but clip it so we don't go crazy."
Jan 3, 2026
Subscribe to:
Comments (Atom)

