Ataraxia through Epoché

Mar 19, 2026

[C++] uintN_t guaranteed to be exactly N bits

uintN_t is guaranteed to be exactly N bits with no padding if it exists, so

sizeof(uint8_t) != sizeof(int32_t)

is guaranteed, if they both exist.

Mar 3, 2026

[C++] define function in the header works better for LTO

[C++] Elaborated type specifier

Reference:
https://en.cppreference.com/w/cpp/language/elaborated_type_specifier.html

https://vsdmars.blogspot.com/2014/02/c11-extended-friend-declaration.html


class T
{
public:
    class U;
private:
    int U;
};
 
int main()
{
    int T;
    T t; // error: the local variable T is found
    class T t; // OK: finds ::T, the local variable T is ignored
    T::U* u; // error: lookup of T::U finds the private data member
    class T::U* u; // OK: the data member is ignored
}

template<typename T>
struct Node
{
    struct Node* Next; // OK: lookup of Node finds the injected-class-name
    struct Data* Data; // OK: declares type Data at global scope
                       // and also declares the data member Data
    friend class ::List; // error: cannot introduce a qualified name
    enum Kind* kind; // error: cannot introduce an enum
};
 
Data* p; // OK: struct Data has been declared

template<typename T>
class Node
{
    friend class T; // error: type parameter cannot appear in an elaborated type specifier;
                    // note that similar declaration `friend T;` is OK.
};
 
class A {};
enum b { f, t };
 
int main()
{
    class A a; // OK: equivalent to 'A a;'
    enum b flag; // OK: equivalent to 'b flag;'
}

enum class E { a, b };
enum E x = E::a; // OK
enum class E y = E::b; // error: 'enum class' cannot introduce an elaborated type specifier
 
struct A {};
class A a; // OK

Feb 3, 2026

Jan 14, 2026

[C++] Cache-friendly

Just always having int as default if usage is under contract.

Reference:

Object Lifetimes reading minute
alignas(int) unsigned char buffer[sizeof(int)]; // used for provenance contract.

One byte types:

Benchmark shows

char8_t is faster due to optimizer.

this is that a pointer of char, unsigned char, or std::byte type can point to anywhere, thus

data.size() could potentially being modified inside the loop, thus accessing data.size() is necessary.
As for char8_t has no this privilege, thus data.size() is fixed and stored inside a register.

fix: old trick, store the data.size() inside the register ourselves.

or just:

Bitfields is good for saving memory but accessing them are slow (bitmask other values).
also, under meta-programming, type inference of those bitfields type has to be casted.

Reference:

https://muxup.com/2023q4/storing-data-in-pointers

When choosing a container type to use, consider:

L2 cache is used for data and instructions.

i.e. how we code matters the speed, even the results are the same.

e.g.
slower, due to branch and if branch succeed, instructions are replaced in the cache.

faster, due to branch rarely happen, and REPEAT64 has much more instruction than simple count+=d

Summary

Reference:

[Design][Software Engineering][C++] Using/Design type effectively - Ben Deane@Blizzard

[Pacific++ 2018][re-read] "Designing for Efficient Cache Usage" - Scott McMillan

[C++] memory model in depth

Always start with DoD; focus on algorithm.

DoD is more packed with data layout, good for caching.

SoA

Reference:

Data Storage in Entity Component Systems - Mathieu Ropert - Meeting C++ 2023

Summary

Align with the cache line to avoid false sharing

CS 101, keep false sharing affect in mind.

All-in-all summary

Avoid long branches:


// BAD
void process_data(Data* d) {
    if (!d) {
        // LONG BRANCH / COLD CODE
        // Imagine 100 lines of logging, stack tracing, 
        // and complex error recovery logic here.
        log_error("Null pointer detected...");
        cleanup_subsystems();
        notify_admin_via_snmp();
        throw std::runtime_code("Fatal Error");
    }

    // This "hot" code is now physically far away from the 'if' 
    // check in the compiled binary.
    d->value += 42; 
    d->status = Ready;
}


// Better, avoid I-Cache miss

// Move cold logic to a non-inline function
[[noreturn]] void handle_fatal_error() {
    log_error("Null pointer detected...");
    cleanup_subsystems();
    notify_admin_via_snmp();
    throw std::runtime_code("Fatal Error");
}

void process_data_optimized(Data* d) {
    // C++20 [[likely]]/[[unlikely]] attributes guide the compiler
    if (!d) [[unlikely]] {
        handle_fatal_error(); // A jump to a far-away location happens ONLY on error
    }

    // This code is now physically adjacent to the 'if' check
    d->value += 42; 
    d->status = Ready;
}

Jan 5, 2026

[6.S191][note] Introduction to Deep Learning

Why activation functions?

introduce non-linearities into the network.

The loss of network measures the cost incurred from incorrect predictions.

the empirical loss measures the total loss over entire dataset.

Cross entropy loss can be used with models that output a probability between 0 and 1

f() here is the activation function, returns [0, 1].

- used due to log with probability is negative, thus we apply - to flip to positive value.

Mean squared error loss can be used with regression models that output continuous real numbers.

Training the network

Focus on Loss Optimization

We want to find the network weights that achieve the lowest lose.

Here's the gradient decent comes in. (back propagation)

i.e. every weight.

Thus, by using derivative(i.e. m, slope), we can see if the m is >0 or < 0. If > 0, we lower the weight. < 0, we increase the weight.

So how much the steps to take? (learning rate)
set too small, slow, too large, unstable.

Methods:

https://www.ruder.io/optimizing-gradient-descent/

Jan 4, 2026

[AI math] Proximal Policy Optimization (PPO)

Proximal Policy Optimization(PPO 2017)

It tries to solve destructive updates issue.

In Reinforcement Learning, if you take a step that is too large, the agent "falls off a cliff." It adopts a policy that causes it to fail immediately, which generates bad data, which leads to even worse updates. It never recovers.

The Deep Dive: The Mechanics of "Clipping"

The genius of PPO lies in its pessimistic view of the world. It doesn't trust its own recent success.

When the agent collects data, it calculates the Ratio between the new policy and the old policy for a specific action:

The "Pessimistic" Update

Standard algorithms simply push this ratio as high as possible for good actions. PPO says: "Stop."

It looks at the Advantage (how good the action was) and applies a Clip:
If the action was good (Positive Advantage): PPO increases the probability of doing it again, BUT it caps the Ratio at roughly 1.2 (increasing probability by 20%).
Even if the math says "this action was amazing, boost it by 500%," PPO clips the update at 20%. It refuses to overcommit to a single piece of evidence.
If the action was bad (Negative Advantage):
PPO decreases the probability, BUT it caps the decrease at 0.8. It refuses to completely destroy the possibility of taking that action again based on one failure.

This results in a "Trust Region"—a safe zone around the current behavior where the agent is free to learn, but beyond which it is physically prevented from moving.

Usage

Reinforcement Learning from Human Feedback (RLHF).

The Setup

The Agent (Policy): The LLM itself (e.g., Llama-2-70b). Its "Action" is choosing the next token (word) in a sentence.
The Environment: The conversation window.
The Reward Model: A separate, smaller AI that has been trained to mimic a human grader. It looks at a full sentence and outputs a score (e.g., 7/10 for helpfulness).

The PPO Training Loop

Rollout (Data Collection): The LLM generates a response to a prompt like "Explain gravity."
LLM: "Gravity is a force..." It records the probability of every token it chose (e.g., it was 99% sure about "force").
Advantage Calculation: The Reward Model looks at the finished sentence and gives it a score (Reward). The PPO algorithm compares this score to what it expected to get.

Scenario: The model usually writes boring answers (Expected Reward: 5). This answer was witty and accurate (Actual Reward: 8).
Advantage: +3. This was a "better than expected" sequence of actions.

The PPO Update (The Critical Step): We now update the LLM's neural weights to make those specific tokens more likely next time.

Without PPO: The model might see that high reward and drastically boost the probability of those specific words, potentially overfitting and making the model speak in repetitive loops or gibberish just to chase that score.
With PPO: The algorithm checks the Ratio.
"Did we already increase the probability of the word 'force' by 20% compared to the old model?"
Yes? -> CLIP. Stop updating. Do not push the weights further.

PPO is not just an optimization method; it is a constraint method.
It allows AI to run training loops that would otherwise be unstable.

Policy

Policy is a function that maps a State (what the agent sees) to an Action (what the agent does).

In mathematical formulas, the policy is almost always represented by the Greek letter π.

a = π(s)

s: State(input)

π: Policy(logic)

a: Action(output)

The Two Types of Policies

Deterministic Policy
This policy has no randomness. For a specific situation, it will always output the exact same action.
Example: A chess bot. If the board is in arrangement X, it always moves the Knight to E5
a = π(s)
Stochastic Policy (Used in PPO)
This policy deals in probabilities. Instead of outputting a single action, it outputs a probability distribution over all possible actions. The agent then samples from this distribution.
Example: A robot learning to walk. If it is tilting left, it might be 80% likely to step right and 20% likely to wave its arm. It rolls the dice to decide.
π(a | s) : read as "the probability of taking action a given state s

Stochastic policies are essential for learning. If a policy is 100% deterministic, the agent never tries anything new; it just repeats the same mistakes. By adding randomness (probabilities), the agent explores different actions, which allows it to discover better strategies.

In modern AI (like PPO), the "Policy" is a Neural Network.

Input: The network receives the State (e.g., the pixels of a video game screen, or the text of a user prompt).
Hidden Layers: The network processes this information.
Output: The network outputs numbers representing the probability of each action.
Action 1 (Jump): 0.1
Action 2 (Run): 0.8
Action 3 (Duck): 0.1

When we say we are "training the policy," we are simply adjusting the weights of this neural network so that it assigns higher probabilities to "good" actions and lower probabilities to "bad" actions.

Value Function (V)

often called the Critic, is the partner to the Policy (the Actor).

While the Policy answers "What should I do?", the Value Function answers: "How good is it to be in this situation?"

1. The Core Concept: Prediction
Imagine you are playing a video game.
The Policy (Actor): Looks at the screen and presses the "Jump" button.
The Value Function (Critic): Looks at the screen and says, "We currently have a 70% chance of winning."
The Critic doesn't play the game; it predicts the outcome.
The Math V(s)
The Value function maps a State s to a single number (Scalar).
Why PPO Needs the Critic (The Advantage)
This is the most important part. PPO updates the Policy based on the Advantage. The Advantage is calculated using the Value Function.
The Logic: To know if an action was "good," we can't just look at the reward.
Example: If you get a reward of +10, is that good?
If you usually get +1, then +10 is amazing.
If you usually get +100, then +10 is terrible.
The Value Function provides that baseline "usual" expectation.
How the Critic Learns
While the Actor learns to maximize reward, the Critic learns to be a better predictor.
It uses a simple regression loss function (Mean Squared Error):

At every step, the Critic looks at the reward actually received and updates itself: "I predicted this state was worth 5 points, but we actually got 8. I should update my weights to predict higher next time."
The PPO Team:
Actor (Policy π): "I think we should move Left."
Environment: (Agent moves Left, gets +5 reward, lands in new state).
Critic (Value V): "I expected a reward of +2. You got +5. That was a great move!"
PPO Update: "Since the Critic said that was 'great' (Positive Advantage), let's adjust the Actor's weights to make 'Move Left' more likely next time—but clip it so we don't go crazy."

Mar 19, 2026

[C++] uintN_t guaranteed to be exactly N bits

Mar 3, 2026

[C++] define function in the header works better for LTO

[C++] Elaborated type specifier

Feb 3, 2026

[godbo.lt architecture]

Jan 14, 2026

[C++] Cache-friendly

Jan 5, 2026

[6.S191][note] Introduction to Deep Learning

Jan 4, 2026

[AI math] Proximal Policy Optimization (PPO)

Proximal Policy Optimization(PPO 2017)

The Deep Dive: The Mechanics of "Clipping"

The "Pessimistic" Update

Usage

Reinforcement Learning from Human Feedback (RLHF).

The Setup

The PPO Training Loop

Policy

The Two Types of Policies

Value Function (V)