Jul 1, 2026

[benchmark] using rdtsc counter

RDTSC (Read Time-Stamp Counter) instruction returns the number of clock cycles since the processor was last reset.

On modern processors, RDTSC does not count actual, variable CPU clock cycles affected by power-saving states (C-states) or Turbo Boost. Instead, it uses a feature called Invariant TSC.

Invariant TSC: The counter increments at a constant, fixed frequency (usually the base/nominal frequency of the processor), regardless of the current operational frequency or power state.

Implication: Because it measures reference time rather than actual executed core clock cycles, if CPU turbos up to 5GHz but its base frequency is 2.5GHz, RDTSC will still increment at the 2.5GHz rate.


Because modern CPUs execute instructions out-of-order, RDTSC can float ahead or behind the block of code we are trying to benchmark.

To get an accurate cycle count for a specific code snippet, we must fence the instruction.

RDTSCP: A serialized variant that waits until all previous instructions have executed before reading the counter (though it doesn't prevent subsequent instructions from moving above it).

LFENCE; RDTSC: The industry-standard way to benchmark. Placing an LFENCE (Load Fence) right before RDTSC forces the CPU to serialize execution, ensuring measure exactly what happens between fences.


The Fence Strategy

lfence (Load Fence): This instruction acts as a serializing barrier for instruction execution. It forces the CPU to wait until all previous instructions in the pipeline have completed execution before it allows any subsequent instructions to begin.

The Goal: Putting lfence before rdtsc prevents the CPU from executing rdtsc early (out-of-order). It guarantees that everything you wanted to measure before this benchmark has truly finished before the timer starts.

The Register Mapping

The x86 rdtsc instruction reads the 64-bit Time-Stamp Counter and splits the value across two 32-bit registers:
EDX gets the high-order 32 bits.
EAX gets the low-order 32 bits.
The inline assembly output constraints capture this:
"=a"(lo): Tells the compiler to bind the value in EAX (a) to the C variable lo.
"=d"(hi): Tells the compiler to bind the value in EDX (d) to the C variable hi.
uint64_t rdtsc_start() {
    uint32_t lo; uint32_t hi;
    __asm__ __volatile__(
        "lfence\n\t"
        "rdtsc"
        : "=a"(lo), "=d"(hi)
        :
        : "memory");
    return ((uint64_t)hi << 32) | lo;
}

The Fence Strategy

rdtscp (Read Time-Stamp Counter and Processor ID): Unlike rdtsc, rdtscp is natively partially serialized. The hardware guarantees that rdtscp will wait for all prior instructions to complete before executing. This ensures that the code you are benchmarking has fully finished before the stop-timer runs.

lfence afterward: While rdtscp prevents earlier instructions from slipping past it downward, it does not prevent later instructions from leaking upward (executing early before the timestamp is taken). Adding lfence right after rdtscp pins the timestamp in place, ensuring that nothing belonging after your benchmark executes before rdtscp completes.

The Clobber List Change
"rcx": In addition to writing the 64-bit timestamp to EDX:EAX, the rdtscp instruction also writes the hardware processor ID (core ID) into the ECX register. Because the inline assembly overwrites RCX as a side effect, you must declare "rcx" in the clobber list so the compiler knows its previous contents are ruined.
uint64_t rdtsc_end() {
    uint32_t lo; uint32_t hi;
    __asm__ __volatile__(
        "rdtscp\n\t"
        "lfence"
        : "=a"(lo), "=d"(hi)
        :
        : "rcx", "memory");
    return ((uint64_t)hi << 32) | lo;
}


Why __volatile__?

Without __volatile__, the compiler's optimization passes might conclude that reading a hardware counter is a pure function or that its order doesn't matter relative to adjacent C statements. 
__volatile__ forces the compiler to leave the assembly block exactly where you put it and prevents it from being optimized away.


The "memory" Clobber

The "memory" token tells the compiler that this assembly block read or wrote to arbitrary locations in RAM. This creates a compiler-level memory fence, forcing the compiler to flush registers back to memory before the block and reload them afterward. This stops the compiler from scheduling code movements across the boundary.



[ Pre-benchmarking Code ]
----------------------------------- <- lfence forces completion of above
RDTSC (Start Timer)
===================================
[ Critical Code Block to Measure ] <- Cannot leak upwards (lfence blocks it)
=================================== <- Cannot leak downwards (rdtscp blocks it)
RDTSCP (Stop Timer)
----------------------------------- <- lfence forces completion of rdtscp
[ Post-benchmarking Code ]

Jun 30, 2026

[Design] Cache friendly design, CPU/Memory level

Reference:
What do you mean by "Cache Friendly"? - Björn Fahller
What Every Programmer Should Know About Memory

Simplistic model of cache behaviour

Includes

  • The cache is small
  • and consists of fixed size lines
  • and is very very fast when hit
  • and missing is very slow

Excludes

  • Multiple levels of caches
  • Associativity
  • Threading

All models are wrong, but some are useful


$ valgrind --tool=callgrind --cache-sim=yes --dump-instr=yes --branch-sim=yes
$ perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-Load-misses --call-graph=dwarf
$ perf record -e cycles,instructions,L1-dcache-loads,L1-dcache-Load-misses --call-graph=dwarf
$ pert report
$ hotspot perf.data

macos:
$ samply


Rule of thumb

follow/chasing pointer -> cache miss.

So,
Get rid of pointers.
Use std::vector (contiguous memory) vs. prev/next pointers.
std::vector be aware of memmove/memcpy.
memmove cause cache miss; if possible, memcpy.
Thus try to move less data.

binary tree/search still chasing pointer, 
can we get Log(n) lookup/insert without chasing pointers?
Heap. Use contiguous memory(e.g. std::vector) to implement.

But the heap is not searchable. Need extra contiguous memory to store the elements in order;
and the element holds the index to the contigous memory. Thus when the smallest heap
is processed, use the index to the contiguous memory and process that chunk.
Trade off: extra memory usage for speed and less cache miss.

Can we make it better?
B-Heap(binary heap implemented to keep subtrees in a single page); smaller heap fit into cache-line.
(not really better, depends on use case. i.e. if within a loop, the cache is hot, good.
otherwise, not worth the extra instructions due to cache is flushed anyways.)













Jun 16, 2026

[low latency] What is low latency (definition)

Reference:
https://github.com/crill-dev/crill

Low latency <-> High throughput

Server side is in the middle.

HFT, Audio, Game is more lean to the left.

2 Categories

  • Efficient programming
  • Programming for deterministic execution time
Most crucial thing: measuring!







Efficient programming

Microbenchmarks are tricky

  • Warm the cache
  • Randomise the heap
  • Measure release build with same compiler flags
  • But optimisations change what code you are measuring!
    • A lot of stuff can "constexpr away" in microbenchmark but not in production code


Writing efficient code requires...

  • Knowledge of the programming language
  • Knowledge of the libraries used
  • Knowledge of the compiler
  • Optimiser
    • ABI (Itanium, Microsoft, ...)
  • Knowledge of the hardware architecture
    • CPU architecture (Instruction set, pipeline, SIMD...)
    • Cache hierarchy (Registers, L1/2/3 cache)
    • Prefetcher, translation look-aside buffer
    • Branch predictor, branch target buffer


Avoid unnecessary work

  • Avoid unnecessary copies
  • Avoid unnecessary function calls / indirections
  • inline functions
  • prefer std::variant / CRTP / "deducing this" (since C++23)
    over virtual functions
  • Make as many decisions as possible at compile time
    • constexpr all the things
    • Template meta-programming
    • Generate lookup tables at compile time
  • Use efficient mathematical operations
    • Fast approximations
    • Use powers of two for sizes (compiler can replace division/mod with bit shifts)
    • Lookup tables
    • Many other techniques


Low-level bit manipulation in C++

  • Object lifetime rules
  • Aliasing rules
  • Alignment rules
  • Object representations
  • Value representations
  • std::bit_cast (since C++20)
  • Implicit-lifetime types (since C++20)
  • std::start_lifetime_as (since C++23)


The optimiser & undefined behaviour

  • memory-related UB
  • type system violation
  • out-of-bounds
  • lifetime violation (dangling pointers, use-after-free)
  • uninitialised variables
  • data races
  • signed integer overflow
  • infinite loops with no side effects


The optimiser & undefined behaviour


Sub-properties of reproducible & unsequenced

  • stateless: function that does not define mutable static or thread-local objects (nor do functions that are called by it)
  • effect-less: function that does not have observable side effects
  • idempotent: repeated evaluation gives the same result (hence may read global state)
  • independent: does not depend on other state than the arguments or constants (hence may write to globals)
  • reproducible: effect-less and idempotent
  • un-sequenced: stateless, effect-less, idempotent, and independent

Assume explained:

  int f(int i) {
  	[[assume(++i == 43)]]
    return i;
  }
  // function f can be optimized to '42'
  


C++ currently does not have a way to tell the compiler that the pointer are not alias:



CPU pipeline hazards

  • Branch hazard
  • Data hazard
  • Hardware hazard
    • Limited amount of adders/shifters
    • Limited amount of load/store units
    • (latest Intel CPUs: 3 loads + 2 stores per cycle)
    • → sometimes you can replace loads by shifts → increase bandwidth of every load using SIMD



[likely]] and [unlikely]]
https://vsdmars.blogspot.com/2016/01/likely-or-unlikely-easy-misleading.html

  • Does not affect branch predictor!
  • Can affect code layout
  • Have various of pitfalls
    • Aaron Ballman: "Don't use the [likely]] or [unlikely]] attributes"
    • Amir Kirsh & Tomer Vromen:"C++20 Likely and Unlikely:
    • A Journey Through Branch Prediction and Compiler Optimizations" (2022)

The 3 Types of Data Hazards

  • RAW (Read After Write): The most common type. Instruction B tries to read data before Instruction A finishes writing it.
  • WAR (Write After Read): Instruction B tries to write to a location before Instruction A has a chance to read it.
  • WAW (Write After Write): Instruction B tries to write to a destination before Instruction A writes to it, potentially leaving the wrong final value.

Rarely have to worry about this breaking your program because the system handles it automatically:

  • The Hardware (CPU): Modern CPUs use a trick called Data Forwarding (passing the result directly from the math unit to the next instruction before it's even written to memory) or they just stall the processor for a cycle to let the data catch up.
  • The Compiler: Optimization flags (like -O2 or -O3 in GCC/Clang) allow the C++ compiler to smartly rearrange your instructions so that independent calculations are placed in between dependent ones, keeping the pipeline moving smoothly without stalls.


SIMD

  • "Single instruction, multiple data"
  • CPU-specific: MMX, SSE 1/2/3/4, AVX, AVX2, AVX-512, AMX, NEON, SVE...
  • How to use?
  • Auto-vectorisation
  • Explicit vectorisation using SIMD libraries
    • Google Highway, xsimd, vectorclass, eve, std::simd proposal for C++26 (P1928)
  • Jeff Garland: "SIMD Libraries in C++" (CppNow 2023)
  • Writing intrinsics
  • Writing assembly
  • SWAR ("SIMD Within A Register")

Other SIMD considerations

  • If you know the exact target CPU:
    • use arch compiler flags
  • If you don't:
    • dynamic dispatch
    • function multiversioning (GCC/Clang only)

function multiversioning:

#include <iostream>

// 1. Version optimized for modern CPUs with AVX2
__attribute__((target("avx2"))) 
void process_data() {
    std::cout << "Running high-performance AVX2 vectorized version!\n";
    // Fast vector math goes here
}

// 2. Version optimized for older SSE4.2 capable CPUs
__attribute__((target("sse4.2"))) 
void process_data() {
    std::cout << "Running mid-tier SSE4.2 version!\n";
}

// 3. The mandatory default fallback version
__attribute__((target("default"))) 
void process_data() {
    std::cout << "Running generic baseline version.\n";
}

int main() {
    // You call it like a regular function. 
    // The resolution happens completely behind the scenes.
    process_data(); 
    return 0;
}  

Autovectorisation

  • Highly dependent on compiler
  • Only works with vectors/arrays of int, char, double etc. - no structs
  • Workaround: struct of arrays instead of array of structs
  • Traverse data linearly
  • for loop, not while loop
  • Number of iterations predetermined (ideally, known at compile time)
  • No data-dependent break, goto, etc.
  • No conditionals
  • No data dependencies between array elements
  • No aliasing




Cache miss

any optimization is pointless if there's a cache miss.
Minimise data cache misses
  • Data locality
  • Align data on cachelines
  • Concurrency aspect (true/false sharing)
  • Contiguous data traversal (also good for prefetcher)
  • "Almost always vector"
  • Cache-friendly associative containers
  • std::flat_set/std::flat_map (C++23), Abseil containers
  • Cache-friendly algorithms
    • Cache-friendly binary search
Minimise instruction cache misses
  • Consider generated code layout & alignment
  • Avoid branches
  • Avoid virtual functions
    • std::variant
    • Compile-time polymorphism
    • CRTP, mixins, "deducing this" (since C++23)

Keep the cache warm
  • Data cache
    • Periodically poke data on a timer
  • Instruction cache
    • Periodically run the hot path with dummy input/output


Other fun hardware problems
  • CPU throttling
    • Due to overheating
      • spread thermal dissipation
    • Due to low activity
      • keep CPU busy with pause instructions




Programming for deterministic execution time

What not to do in the hot path:

  • Dynamic memory allocations/deallocations
  • blocking the thread
  • I/O
  • exceptions
  • context switches / mode switches (user/kernel space)
  • syscalls
  • calling into unknown code
  • loops without definite bounds
  • algorithms > O(1), or with no statically known upper bound on N

Dynamic memory allocations/deallocations

Avoiding allocations
  • Do not use data types that allocate
  • Do not use algorithms that allocate
  • Do not use data structures that allocate
Don't use(in real time):
  • std::stable_sort 
  • std::stable_partition 
  • std::inplace_merge

Do use(in real time):
std::array
std::pair
std::tuple
std::optional 
std::variant

Don't use(in real time):
std::any
std::function
std::vector, std::deque, std::list etc.


Customer allocators(also for coroutine):
tcmalloc, rpmalloc...
are not good for ultra low latency and real-time:
  • minimising average cost, not worst case
  • not constant time
  • multithreaded (locks)
  • eventually go to OS to request dynamic memory

Custom Allocators
Preallocate everything
  • Monotonic allocators
  • std::pmr::monotonic_buffer_resource
  • Pool allocators
  • std::pmr::unsynchronised_pool_resource
  • Frame allocators
  • Arena allocators
  • Double-ended allocators for big/small buffers
  • Lock-free allocator (require helper thread)


Lambda does not malloc.

Coroutine might.
rely on the optimiser?
→ Eyal Zedaka: "Using Coroutines to Implement C++
Exceptions for Freestanding Environments" (CppCon 2021)
  • create and suspend coroutine frame upfront
  • write your own promise type, defining its own custom operator new and operator delete
  • Don't use coroutines in a low-latency/real-time scenario


Blocking the thread

Don't use mutex. (and spanner violates all these rules in every aspects lol)


Concurrency
  • atomic == indivisible, race-free
  • lock-free == at least one thread is guaranteed to make progress
  • wait-free == all threads are guaranteed to make progress

Wait-free concurrency
  • Can't block hot path (no mutexes, no spinning)
  • Can't do any syscalls
  • Can't do anything with unbounded/non-deterministic runtime
  • Only available synchronisation mechanism:
  • always:
    • static_assert(std::atomic<T>::is_always_lock_free);
How:
  • Passing data to/from "hot path" thread
    • wait-free queue
  • Sharing data with "hot path" thread
    • Hot path reads
      • spinlock try_lock (involve syscall.) Solution: spinlock with try_lock.
      • "spin-on-write"
        • Easy to use
        • Reader always wait-free
        • Tradeoffs:
          • Single reader
          •  read is 2x faster then std::atomic::exchange; because .load read from
            local CPU cache(since the value is MESI shared)
          •  reader blocks writer(s) who need to spin
          • one writer (or multiple writers who block each other)
          • writer needs to heap-allocate + copy
        std::unique_ptr‹biquad_coefficients> storage;
        std::atomic<biquad_coefficients*> coeffs;
        void process(audio_buffer& buffer) { auto* current_coeffs = coeffs.exchange(nullptr); process_biquad(buffer, *current_coffs) ; coeffs.store(current_coeffs) ; } void update_coeffs (biquad_coefficients new_coeffs) { auto new_coeffs = std::make_unique‹biquad_coefficients>(new_coeffs); for (auto* expected = storage.get(); !coeffs.compare_exchange_weak(expected, new_coeffs.get()); expected = storage.get() /* spin */;) storage = std::move(new_coffs); } // Or just use crill lib crill::spin_on_write_object<biquad_coefficients> coeffs; void process (audio_buffer& buffer) { auto read_ptr = coeffs.lock_read(); process_biquad (buffer, *read_ptr); } void update_coeffs(biquad_coefficients new_coeffs) { coeffs.update(new_coeffs); }
      • RCU
        • read is now a single atomic load (= no overhead on modern platforms)
        • Multiple concurrent readers & writers
        • Readers don't block writers
        • But we need to solve deferred reclamation problem!
          • hard...
          • https://www.youtube.com/watch?v=7fKxIZOyBCE
            crill::defer_reclaim_object<biquad_coefficients> coeffs;
            void process(audio_buffer& buffer) {
              uto read_ptr = coeffs.Lock_read() ;
              process_biquad (buffer, *read_ptr);
            }
            
            void update_coeffs (biquad_coefficients new_coeffs) {
              coeffs.update(new_coeffs);
            }
            
            void timer_callback(){
              coeffs.reclaim();
            }
          • Reading always wait-free
          • Multiple readers
          • algorithm simplifies greatly if single reader
          • read is single atomic load (= no overhead on modern platforms)
          • Readers do not block writers
          • Writer needs to do heap allocation
          • User needs to manage reclamation
        • Variation: reclaim_on_write_object
          • Reading always wait-free
          • Multiple readers
          • algorithm simplifies greatly if single reader
          • read is single atomic load (= no overhead on modern platforms)
          • Writer waits for active reader(s) to finish
          • Writing is fast and does not require heap allocation
          • No need to manage reclamation
    • Hot path writes
      • Double-buffering (just atomic SWAP to update the data.)
        • Always be aware of ABA problem. Thus always having state in the single
          atomic variable.
          std::array<frequency_spectrum, 2> slots;
          std::atomic<int> idx = {0};
          
          void process (audio_buffer& audio_in) {
          	auto spectrum = calculate_spectrum(audio_in);
          	int write_id = id.load();
          	slots [write_idx] = spectrum;
          }
          
          void update_spectrum() {
          	int read_id = idx.fetch_xor(1);
          	draw_spectrum(slots[read_idx]);
          }
          
          // Solve the ABA problem
          std::array<frequency_spectrum, 2> slots;
          std::atomic<int> idx = {0};
          enum {
          	BIT_IDX = (1 << 0),
          	BIT_NEWDATA = (1 << 1), 
              BIT_BUSY = (1 << 2),
          };
          void process (audio_buffer& audio_in) {
          	auto spectrum = calculate_spectrum(audio_in);
          	int write_idx = idx.fetch_or(BIT_BUSY) & BIT_IDX;
          	slots [write_idx] = spectrum;
          	idx.store ((write_idx & BIT_IDX) | BIT_NEWDATA);
          }
          // ...
          
      • SeqLock (HFT technique)
        • Writing always wait-free
        • Good solution if writing happens more rarely, data not too large
        • Tradeoffs:
          • Single writer, multiple readers
          • Readers lock-free but not wait-free (might have to retry unbounded nr of times)
          • Data must be trivially copyable
          • Overhead of copying data atomically on both reader & writer
      • std::atomic<std::size_t> seq = 0;
        // single thread doing the write.
        void store(T t) noexcept {
        	std::size_t old_seq = seq.fetch_add(1);
            /*
            // Faster
            auto old_seq = seq.load(std::memory_order_relaxed);
            seq.store(old_seq + 1, std::memory_order_relaxed);
        
            std::atomic_thread_fence(std::memory_order_release);
            */
        	// write data...
        	seq.store(old_seq + 2);
        }
        
        bool try_load(T& t) const noexcept {
        	std::size_t seq1 = seq.load(std::memory_order_acquired);
        	if (seq1 % 2 != 0) return false;
        	
            // read data...
            /*
            
            Use Byte-wise atomic memcpy to read the data;
            or chunk it; DO NOT USE memcpy
            
            for (size_t i = 0; i < count; ++i) {
        		reinterpret_cast<char*>(dest)[i] =
        			atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order_relaxed);
            }
        	
            atomic_thread_fence(order);
            */
            
            std::atomic_thread_fence(std::memory_order_acquired);
        	std: :size_t seq2 = seq.load(std::memory_order_relaxed);
        	return seq1 == seq2;
        }
Ring-Buffer, as always



I/O

  • With other threads: Push message into wait-free specific queue
  • With other processes: Shared memory
  • With hardware: Direct Memory Access (DMA)

Reading data from disk on the hot path
  • Pre-load into RAM and lock address range (to prevent swap-out) / MMAP
    • mlock (POSIX)
      mlock(), mlock2(), and mlockall() lock part or all of the calling process's virtual address space into RAM, preventing that memory from being paged to the swap area. munlock() and munlockall() perform the converse operation, unlocking part or all of the calling process's virtual address space, so that pages in the specified virtual address range can be swapped out again if required by the kernel memory manager. Memory locking and unlocking are performed in units of whole pages.
    • VirtualLock (Windows)
  • Disk streaming
    • Pre-load into RAM first ~ 100 ms of every possible sound
    • Once it starts playing, start filling in the rest from disk (on a background thread)

No exceptions.

Avoiding context switches/mode switches
  • Mainstream operating systems: Thread priority
  • Real-time operating systems: deterministic thread scheduler
  • If you control the hardware:
    • Kernel bypass
  • If your hot path is in a single thread,
  • and you don't care about efficiency of other threads:
    • Turn off hyperthreading
    • Pin hot path thread to one CPU core




[Optimization] Peephole Optimizations

Reference:
https://blog.regehr.org/archives/2485

Local Peephole Optimization and Compositional Refinement

Let f be a function targeted for peephole optimization. 
We can structurally decompose f into a target redex o (the optimizable instruction sequence) and a continuation context r (the residual program), such that:
f = r ∘ o

By definition, a correct optimization yields a refined sequence o′ that refines the original sequence o, denoted as:
o ⊑ o′

To lift this local optimization to the function level, we rely on the compositionality of the refinement relation. For any valid refinement definition, the context r must act as a monotonic operator:
If x ⊑ y, then r(x) ⊑ r(y)

Applying compositionality to our decomposition:
r(o) ⊑ r(o′)

Because f = r(o), it transitively follows that:
f ⊑ r(o′)

This proves that the locally optimized function refines the original function. By inductively applying this compositionality principle, we can scale this local refinement guarantee across the entire program.

Jun 13, 2026

[RACI]

The 4 RACI Roles

  • Responsible (The "Doer"): The person (or team) who actually performs the work to complete the task. They are hands-on and do the heavy lifting.
    Rule of thumb: There should be at least one 'R', but multiple contributors can share this responsibility.
  • Accountable (The "Owner"): The individual ultimately answerable for the correct and thorough completion of the deliverable. They delegate the work and sign off on the final result.
    Rule of thumb: There must be exactly one Accountable person per task to guarantee decision-making authority.
  • Consulted (The "Expert"): People who provide subject-matter expertise or input before a decision or action is finalized.
    Rule of thumb: Communication is two-way.
  • Informed (The "Loop"): People who need to be kept up-to-date on project progress, usually stakeholders or leadership.
    Rule of thumb: Communication is one-way—they are simply updated, not required to make decisions.

May 27, 2026

[C++] pointer arithmetic

Refernece:
[C++] Object Lifetimes reading minute
class Vec {
  public:
    double* data() { return &x; }

  private:
    double x,y,z;
};

Eigen::Map<...>(absl::Span<Vec>);

*(Vect::data() + 1) // Does not give us y

Take away: 

* Even though x, y, and z are allocated sequentially in memory without padding,
physical layout does not supersede semantic rules.
* The layout guarantees mean you can safely memcpy the data, or cast a Vec* to a double* to access the first element (x). It does not grant permission to use pointer arithmetic on double* to slide across the members.
* The pointer arithmetic is only guaranteed within the type of array.
* Pointer to variable only is considered as pointer to array of size 1.
* Thus any pointer arithmetic on single variable is considered out-of-bound; compiler is free to assume anything.

Explain:

Only char*, unsigned char*, and std::byte* are explicitly granted an exception in the standard to 
examine the raw object representation. double* enjoys no such privilege.

Fix:

class Vec {
 public:
  double* data() { return data_; } // Legal: returns pointer to element 0 of a 3-element array
 private:
  double data_[3]; // x=data_[0], y=data_[1], z=data_[2]
};