Jul 1, 2026

[benchmark] using rdtsc counter

RDTSC (Read Time-Stamp Counter) instruction returns the number of clock cycles since the processor was last reset.

On modern processors, RDTSC does not count actual, variable CPU clock cycles affected by power-saving states (C-states) or Turbo Boost. Instead, it uses a feature called Invariant TSC.

Invariant TSC: The counter increments at a constant, fixed frequency (usually the base/nominal frequency of the processor), regardless of the current operational frequency or power state.

Implication: Because it measures reference time rather than actual executed core clock cycles, if CPU turbos up to 5GHz but its base frequency is 2.5GHz, RDTSC will still increment at the 2.5GHz rate.


Because modern CPUs execute instructions out-of-order, RDTSC can float ahead or behind the block of code we are trying to benchmark.

To get an accurate cycle count for a specific code snippet, we must fence the instruction.

RDTSCP: A serialized variant that waits until all previous instructions have executed before reading the counter (though it doesn't prevent subsequent instructions from moving above it).

LFENCE; RDTSC: The industry-standard way to benchmark. Placing an LFENCE (Load Fence) right before RDTSC forces the CPU to serialize execution, ensuring measure exactly what happens between fences.


The Fence Strategy

lfence (Load Fence): This instruction acts as a serializing barrier for instruction execution. It forces the CPU to wait until all previous instructions in the pipeline have completed execution before it allows any subsequent instructions to begin.

The Goal: Putting lfence before rdtsc prevents the CPU from executing rdtsc early (out-of-order). It guarantees that everything you wanted to measure before this benchmark has truly finished before the timer starts.

The Register Mapping

The x86 rdtsc instruction reads the 64-bit Time-Stamp Counter and splits the value across two 32-bit registers:
EDX gets the high-order 32 bits.
EAX gets the low-order 32 bits.
The inline assembly output constraints capture this:
"=a"(lo): Tells the compiler to bind the value in EAX (a) to the C variable lo.
"=d"(hi): Tells the compiler to bind the value in EDX (d) to the C variable hi.
uint64_t rdtsc_start() {
    uint32_t lo; uint32_t hi;
    __asm__ __volatile__(
        "lfence\n\t"
        "rdtsc"
        : "=a"(lo), "=d"(hi)
        :
        : "memory");
    return ((uint64_t)hi << 32) | lo;
}

The Fence Strategy

rdtscp (Read Time-Stamp Counter and Processor ID): Unlike rdtsc, rdtscp is natively partially serialized. The hardware guarantees that rdtscp will wait for all prior instructions to complete before executing. This ensures that the code you are benchmarking has fully finished before the stop-timer runs.

lfence afterward: While rdtscp prevents earlier instructions from slipping past it downward, it does not prevent later instructions from leaking upward (executing early before the timestamp is taken). Adding lfence right after rdtscp pins the timestamp in place, ensuring that nothing belonging after your benchmark executes before rdtscp completes.

The Clobber List Change
"rcx": In addition to writing the 64-bit timestamp to EDX:EAX, the rdtscp instruction also writes the hardware processor ID (core ID) into the ECX register. Because the inline assembly overwrites RCX as a side effect, you must declare "rcx" in the clobber list so the compiler knows its previous contents are ruined.
uint64_t rdtsc_end() {
    uint32_t lo; uint32_t hi;
    __asm__ __volatile__(
        "rdtscp\n\t"
        "lfence"
        : "=a"(lo), "=d"(hi)
        :
        : "rcx", "memory");
    return ((uint64_t)hi << 32) | lo;
}


Why __volatile__?

Without __volatile__, the compiler's optimization passes might conclude that reading a hardware counter is a pure function or that its order doesn't matter relative to adjacent C statements. 
__volatile__ forces the compiler to leave the assembly block exactly where you put it and prevents it from being optimized away.


The "memory" Clobber

The "memory" token tells the compiler that this assembly block read or wrote to arbitrary locations in RAM. This creates a compiler-level memory fence, forcing the compiler to flush registers back to memory before the block and reload them afterward. This stops the compiler from scheduling code movements across the boundary.



[ Pre-benchmarking Code ]
----------------------------------- <- lfence forces completion of above
RDTSC (Start Timer)
===================================
[ Critical Code Block to Measure ] <- Cannot leak upwards (lfence blocks it)
=================================== <- Cannot leak downwards (rdtscp blocks it)
RDTSCP (Stop Timer)
----------------------------------- <- lfence forces completion of rdtscp
[ Post-benchmarking Code ]

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.