Ataraxia through Epoché: [C++] C++ Concurrency In Action, Second edition, recap [Ch.5]

Anthony Williams' C++ Concurrency In Action book hits second edition, hereby jotting down reading notes starts from Chapter 5, which draws the C++'s memory model and concurrent program design.
( The Art of Multiprocessor Programming , [multiprocessor programming] types of synchronization)

For the system languages I am currently using having the sequential-consistent memory model, which is quite straight forward to work with (Go, Java). While C++ gives us a bit more,
thus the understanding of MESI , store buffer,
memory/compiler barrier would be a plus for reading through the context.

C++ abstract machine memory model:

https://en.cppreference.com/w/cpp/language/memory_model

All data in a C++ program is made up of objects.
It's a statement about the building blocks of data in C++.
The C++ Standard defines an object as "a region of storage" although it goes on to assign properties to these objects, such as their type and lifetime.

Object memory layout:

Reference:
struct padding http://vsdmars.blogspot.com/search/label/c_struct

Every variable is an object, including those that are members of other objects.
Every object occupies at least one memory location.
Variables of fundamental types such as int or char occupy exactly one memory location, whatever their size, even if they’re adjacent or part of an array.
Adjacent bit fields are part of the same memory location.

Every object in a C++ program has a modification order composed of all the writes to that object from all threads in the program, starting with the object's initialization.

Fixed width integer types (since C99)

https://en.cppreference.com/w/c/types/integer

std::atomic

https://en.cppreference.com/w/cpp/atomic/atomic

compare_exchange_weak() can be used in a loop,

reason is that:

while during the CAS, the thread can be context switched out by OS, which failed the CAS. It can be implemented by FUTEX, but context switched out through OS signals or this blocking system call.
https://vsdmars.blogspot.com/2019/03/futex-futex-skim-through.html

e.g

bool expected=false;
extern atomic<bool> b; // set somewhere else
while(!b.compare_exchange_weak(expected,true) && !expected);

std::atomic<UDT>

UDT has trivial copy-assignment operator.
Every base class and non-static data member of a user-defined type must also have a trivial copy-assignment operator.
Which permits the compiler to use memcpy() or an equivalent operation for assignment operations, because there's no user-written code to run.

The compare-exchange operations do bitwise comparison as if using memcmp(), rather than using any comparison operator that may be defined for UDT.

If the type provides comparison operations that have different semantics,
or the type has padding bits that do not participate in normal comparisons,

then this can lead to a compare-exchange operation failing, even though the values compare equally.

Make UDT less then the size of 1 word or 2 words(DWCAS).

std::atomic<float> / std::atomic<double>'s
compare_exchange_strong() can fail due to different binary
representation of the floating value.

C++20:

template<class T>
struct atomic<std::shared_ptr<T>>;

template<class T>
struct atomic<std::weak_ptr<T>>;

Free functions:

Free functions are designed to be C-compatible, so they use pointers rather than references in all cases.
std::atomic_load
std::atomic_store
std::atomic_is_lock_free()

We have free function(fully specialized template) to load/store shared_ptr as well.

e.g

std::shared_ptr<my_data> p;
void process_global_data()
{
std::shared_ptr<my_data> local = std::atomic_load(&p);
process_data(local);
}

void update_global_data()
{
std::shared_ptr<my_data> local(new my_data);
std::atomic_store(&p, local);
}

synchronizes-with:

Only between operations on atomic types.
Operations on a data structure (such as locking a mutex) might provide this relationship if the data structure contains atomic types and the operations on that data structure perform the appropriate atomic operations internally, but fundamentally it comes only from operations on atomic types.

happens-before relationship

happens-before
strongly-happens-before
(same as happens-before but different in memory_order_consume tag, which only happens in happens-before situation)

Relationships are the basic building blocks of operation ordering in a program; it specifies which operations see the effects of which other operations.

single thread
If one operation is sequenced before another, then it also happens before it, and strongly-happens-before it.
single statement
If the operations occur in the same statement, in general there's no happens-before relationship between them, because they're un-ordered.
between threads
If operation A on one thread inter-thread happens before operation B on another thread, then A happens before B.

Sequentially consistent ordering

It's C++'s default tag, act load/store on same atomic variable.
Trade off on weakly-ordered machine with many cores, heavy-weighted sync needed.

e.g

#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
    x.store(true,std::memory_order_seq_cst);
}

void write_y()
{
    y.store(true,std::memory_order_seq_cst);
}

void read_x_then_y()
{
    while(!x.load(std::memory_order_seq_cst));
    if(y.load(std::memory_order_seq_cst))
        ++z;
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_seq_cst));
    if(x.load(std::memory_order_seq_cst))
        ++z;
}

int main()
{
    x=false;
    y=false;
    z=0;
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join();
    b.join();
    c.join();
    d.join();
    assert(z.load()!=0); // Never happens
}

Non-sequentially consistent ordering

Threads don't have to agree on the order of events.

Relax ordering

Do not participate in synchronizes-with relationships between threads with same variable.
In single thread, the same variable itself still honors happen-before relationship.
Beware here, when variable inside the same thread being store/load, it follows the happens before logic. i.e, it's not issuing RMB/WMB, only guarantees being atomic.
Relaxed operations on separate variables can usually be freely reordered by the compiler or the hardware.

Acquire-release ordering

No total order of operations.
Synchronization is pairwise between the thread that does the release and the thread that does the acquire.
A release operation synchronizes-with an acquire operation that reads the value written.
Relax ordering variable can be used with acquire-release variable, take advantage of the later to do WMB/RMW thus force making the relax ordering variable state predictable.

e.g

#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;
void write_x()
{
    x.store(true,std::memory_order_release);
}

void write_y()
{
    y.store(true,std::memory_order_release);
}

void read_x_then_y()
{
    while(!x.load(std::memory_order_acquire));
    if(y.load(std::memory_order_acquire))
        ++z;
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_acquire));
    if(x.load(std::memory_order_acquire))
        ++z;
}

int main()
{
    x=false;
    y=false;
    z=0;
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join();
    b.join();
    c.join();
    d.join();
    assert(z.load()!=0);  // could fire
}

Or introduce one tagged as relax and another tagged as release to force synchronizes-with.

e.g

>#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;

void write_x_then_y()
{
    x.store(true,std::memory_order_relaxed);
    y.store(true,std::memory_order_release);
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_acquire));
    if(x.load(std::memory_order_relaxed))
        ++z;
}

int main() {
    x=false;
    y=false;
    z=0;
    std::thread a(write_x_then_y);
    std::thread b(read_y_then_x);
    a.join();
    b.join();
    assert(z.load()!=0);
}

std::memory_order_consume this is all about compiler optimization enabling.

vs. std::memory_order_acquire, which does not allow any dependent or non-dependent variables cross the barrier; however, std::memory_order_consume allows non-dependent variables cross the barrier.
e.g

struct X
{
    int i;
    std::string s;
};

std::atomic<X*> p;
std::atomic<int> a;

void create_x()
{
    X* x=new X;
    x->i=42;
    x->s=”hello”;
    a.store(99,std::memory_order_relaxed); // 'a' can not pass p.store
    p.store(x,std::memory_order_release);
}

void use_x()
{
    X* x;
    while(!(x=p.load(std::memory_order_consume)))
        std::this_thread::sleep(std::chrono::microseconds(1));
    assert(x->i==42); // won't fire
    assert(x->s=="hello"); // won't fire
    
    // could fire; while 'a' is non-dependent.
    assert(a.load(std::memory_order_relaxed)==99); 
}
 
int main()
{
    std::thread t1(create_x);
    std::thread t2(use_x);
    t1.join();
    t2.join();
}

std::kill_dependency()
What does `std::kill_dependency` do, and why would I want to use it?

It's also about compiler optimization;
e.g

r1 = x.load(memory_order_consume);
r2 = r1->index;
do_something_with(a[std::kill_dependency(r2)]);

becomes:


predicted_r2 = x->index; // unordered load
r1 = x; // ordered load
r2 = r1->index;
// may be faster than waiting for r2's value to be available
do_something_with(a[predicted_r2]);

or this:

predicted_r2 = x->index; // unordered load
predicted_a  = a[predicted_r2]; // get the CPU loading it early on
r1 = x; // ordered load
r2 = r1->index; // ordered load
do_something_with(predicted_a);

or more aggressive if the compiler knows that do_something_with won't change the
result of the loads for r1 or r2, then it can even hoist it all the way up:

do_something_with(a[x->index]); // completely unordered
r1 = x; // ordered
r2 = r1->index; // ordered

Another example:

int global_data[]={ ... };
std::atomic<int> index;

void f()
{
    int i=index.load(std::memory_order_consume);
    do_something_with(global_data[std::kill_dependency(i)]);
}

Don't use std::memory_order_consume or std::kill_dependency unless you know your hardware.

Acquire-Release

e.g

#include <atomic>
#include <thread>
std::vector<int> queue_data;
std::atomic<int> count;

void populate_queue()
{
    unsigned const number_of_items=20;
    queue_data.clear();
    
    for(unsigned i=0;i<number_of_items;++i)
    {
        queue_data.push_back(i);
    }
    
    count.store(number_of_items,std::memory_order_release);
}

void consume_queue_items()
{
    while(true) {
        int item_index;
        if((item_index=count.fetch_sub(1,std::memory_order_acquire))<=0) {
            wait_for_more_items();
            continue;
        }
        process(...);
    }
}

int main() {
    std::thread a(populate_queue);
    std::thread b(consume_queue_items);
    std::thread c(consume_queue_items);
    a.join();
    b.join();
    c.join();
}

Fences

An atomic operations library wouldn't be complete without a set of fences.
These are operations that enforce memory-ordering constraints without modifying any data and are typically combined with atomic operations that use the memory_order_relaxed ordering constraints.
Fences are global operations and affect the ordering of other atomic operations in the thread that executed the fence.
Fences restrict relaxed variables' freedom and introduce happens-before and synchronizes-with relationships that weren't present before.

e.g

#include <atomic>
#include <thread>
#include <assert.h>
std::atomic<bool> x,y;
std::atomic<int> z;

void write_x_then_y() {
    x.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    y.store(true,std::memory_order_relaxed);
}

void read_y_then_x() {
    while(!y.load(std::memory_order_relaxed));
    std::atomic_thread_fence(std::memory_order_acquire);
    if(x.load(std::memory_order_relaxed))
        ++z;
}

int main() {
    x=false;
    y=false;
    z=0;
    std::thread a(write_x_then_y);
    std::thread b(read_y_then_x);
    a.join();
    b.join();
    assert(z.load()!=0); // won't fire
}

But this doesn't provide x happens-before y relationship:

void write_x_then_y()
{
    std::atomic_thread_fence(std::memory_order_release);
    x.store(true,std::memory_order_relaxed);
    y.store(true,std::memory_order_relaxed);
}

Question, does x need to be atomic?
Not really.
e.g

#include <atomic>
#include <thread>
#include <assert.h>

bool x=false;
std::atomic<bool> y;
std::atomic<int> z;

void write_x_then_y() {
    x=true;
    std::atomic_thread_fence(std::memory_order_release);
    y.store(true,std::memory_order_relaxed);
}

void read_y_then_x() {
    while(!y.load(std::memory_order_relaxed));
    std::atomic_thread_fence(std::memory_order_acquire);
    if(x)
        ++z;
}

int main() {
    x=false;
    y=false;
    z=0;
    std::thread a(write_x_then_y);
    std::thread b(read_y_then_x);
    a.join();
    b.join();
    assert(z.load()!=0); // won't fire
}

synchronization relationships

std::thread

The completion of the std::thread constructor synchronizes with the invocation of the supplied function or callable object on the new thread.
The completion of a thread synchronizes with the return from a successful call to join on the std::thread object that owns that thread.

std::mutex, std::timed_mutex, std::recursive_mutex, std::recursive_timed_mutex

All calls to lock and unlock , and successful calls to try_lock, try_lock_for, or try_lock_until , on a given mutex object form a single total order: the lock order of the mutex.
A call to unlock on a given mutex object synchronizes with a subsequent call to lock, or a subsequent successful call to try_lock, try_lock_for, or try_lock_until, on that object in the lock order of the mutex.
Failed calls to try_lock, try_lock_for, or try_lock_until do not participate in any synchronization relationships.

std::shared_mutex, std::shared_timed_mutex

All calls to lock, unlock, lock_shared, and unlock_shared, and successful calls to try_lock, try_lock_for, try_lock_until, try_lock_shared, try_lock_shared_for, or try_lock_shared_until, on a given mutex object form a single total order: the lock order of the mutex.
A call to unlock on a given mutex object synchronizes with a subsequent call to lock or shared_lock, or a successful call to try_lock, try_lock_for, try_lock_until, try_lock_shared, try_lock_shared_for, or try_lock_shared_until, on that object in the lock order of the mutex.
Failed calls to try_lock, try_lock_for, try_lock_until, try_lock_shared, try_lock_shared_for, or try_lock_shared_until do not participate in any synchronization relationships.

std::promise, std::future, std::shared_future

The successful completion of a call to set_value or set_exception on a given std::promise object synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the promise.
The destructor of a given std::promise object that stores an std::future_error exception in the shared asynchronous state associated with the promise synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the promise.

std::packaged_task, std::future, std::shared_future

The successful completion of a call to the function call operator of a given std::packaged_task object synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the packaged task.
The destructor of a given std::packaged_task object that stores an std::future_error exception in the shared asynchronous state associated with the packaged task synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the packaged task.

std::async, std::future, std::shared_future

The completion of the thread running a task launched via a call to std::async with a policy of std::launch::async synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the spawned task.
The completion of a task launched via a call to std::async with a policy of std::launch::deferred synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the promise.

std::experimental::future, std::experimental::shared_future and continuations

The event that causes an asynchronous shared state to become ready synchronizes with the invocation of a continuation function scheduled on that shared state.
The completion of a continuation function synchronizes with a successful return from a call to wait or get, or a call to wait_for or wait_until that returns std::future_status::ready on a future that shares the same asynchronous state as the future returned from the call to then that scheduled the continuation, or the invocation of any continuation scheduled on that future.

std::experimental::latch

The invocation of each call to count_down or count_down_and_wait on a given instance of std::experimental::latch synchronizes with the completion of each successful call to wait or count_down_and_wait on that latch.

std::experimental::barrier

The invocation of each call to arrive_and_wait or arrive_and_drop on a given instance of std::experimental::barrier synchronizes with the completion of each subsequent successful call to arrive_and_wait on that barrier.

std::experimental::flex_barrier

The invocation of each call to arrive_and_wait or arrive_and_drop on a given instance of std::experimental::flex_barrier synchronizes with the completion of each subsequent successful call to arrive_and_wait on that barrier.
The invocation of each call to arrive_and_wait or arrive_and_drop on a given instance of std::experimental::flex_barrier synchronizes with the subsequent invocation of the completion function on that barrier.
The return from the completion function on a given instance of std::experimental::flex_barrier synchronizes with the completion of each call to arrive_and_wait on that barrier that was blocked waiting for that barrier when the completion function was invoked.

std::condition_variable, std::condition_variable_any

Condition variables do not provide any synchronization relationships. They are optimizations over busy-wait loops, and all the synchronization is provided by the operations on the associated mutex.

Ataraxia through Epoché

Jun 5, 2019

[C++] C++ Concurrency In Action, Second edition, recap [Ch.5]