Showing posts with label cpp17. Show all posts
Showing posts with label cpp17. Show all posts

Sep 28, 2025

[C++] Aggregate initialization takes base type's constructor regardless of `using base::base`

Bite by this constructor resolution rule:

1) D is an aggregate From 
[dcl.init.aggr] 
§11.6.2/1 (C++23 draft N4950): 
An aggregate is an array or a class with no user-declared or inherited constructors… (…) A class is an aggregate even if it has base classes. 
So `class D : public Base {}` is an aggregate.

2) Aggregate initialization rule From 
[dcl.init.aggr] 
§11.6.2/4: 
If the aggregate has a base class, each base class is initialized in the order of declaration using the corresponding elements of the initializer list. 
So the single element {42.3} is forwarded to the base class Base.

3) [class.base.init] §12.6.2/2:
In a non-delegating constructor, if a given potentially constructed subobject is not designated by a mem-initializer, then it is initialized as follows:
— if the entity is a base class, the base class’s default constructor is called,
unless otherwise specified by the rules of aggregate initialization.

And aggregate initialization rules (step 2) say the base is initialized from the element. That means the base is constructed as if by direct-initialization with that element.

4) Constructor overload resolution
Now, how does Base itself get constructed from 42.3?
That’s governed by [dcl.init] §11.6.1/17:
If the initializer is a parenthesized expression-list or a braced-init-list,
constructors are considered. 
The applicable constructors are selected by overload resolution ([over.match.ctor]).
So the compiler now does overload resolution among Base(int) and Base(double). The best match for 42.3 is Base(double).

#include <iostream>


class Base {
  public:
  // mark as `explicit` prevents derived type picking up this constructor
  Base(int) {
    std::cout << "base int\n";
  }

  // mark as `explicit` prevents derived type picking up this constructor
  Base(double) {
    std::cout << "base float\n";
  }
};


class D : public Base {

};


int main() {

  D d = {42.3}; // works
  // D d = 42.3; // failed, unless `using Base::Base;`
}


#include <iostream>


class Base {
  public:

   Base(double, int) {
    std::cout << "base double, float\n";
  }
};

class Base2 {
  public:

   Base2(double, int) {
    std::cout << "base2 double, float\n";
  }
};


class D : public Base, public Base2 {

};


int main() {

  D d = {{42.3, 42}, {42.4, 42}};
}

Nov 26, 2023

[C++] C++17 : C++20 Diff

Reference:
Changes between C++17 and C++20 DIS: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2131r0.html


P0692R1 Access checking on specializations

This change fixes a long-standing, somewhat obscure situation, where it was not possible to declare a template specialization for a template argument that is a private (or protected) member type. For example, given class Foo { class Bar {}; };,
the access Foo::Bar is now allowed in
template<class> struct X;
template<> struct X<Foo::Bar>;

Jul 6, 2022

[C++] count bool 'true' snippet

template <bool B = false> constexpr auto count() -> size_t { return B ? 1 : 0; }

template <bool B1, bool B2, bool... Tail> constexpr auto count() -> size_t {
  return (B1 ? 1 : 0) + count<B2, Tail...>();
}

Apr 19, 2022

[C++] structure binding minute

Reference:
https://en.cppreference.com/w/cpp/language/structured_binding

Quick reference of structured binding rules.

Rules

  • struct MyStruct {
         int i = 0;
         std::string s;
         const int i2 = 42;
    };
    
    MyStruct ms;
    // i.e honor the structure data members' type;
    // i2 is const thus w is const int
    // e is of type of ms
    auto e = ms; 
    aliasname u = e.i; 
    aliasname v = e.s;
    aliasname w = e.i2; // w is const int
  • Apart from decltype of a const structure's data member which the result is non-const.
    Qualifier works on bound alias variable.
    // const ref to anonymous entity; however;
    //the decltype of the type's data member honors the qualifiers.
    const auto& [u,v,w] = ms;
    decltype(u);// const int
    decltype(v);// const std::string
    decltype(w);// const int
  • alignas(16) auto [u,v] = ms; // align the object, not v
  • structured bindings do not decay.
    struct S {
         const char x[6];
         const char y[3];
    };
    
    S s1{};
    auto [a, b] = s1; // a, b is of type "const char [6]", "const char [3]"
  • Structured Bindings is used in: 
    • For structures and classes where all non-static data members are public, you can bind each non-static data member to exactly one name. 
    • For raw arrays, you can bind a name to each element of the array. 
    • For any type, you can use a tuple-like API to bind names to whatever the API defines as "elements."
      The API roughly requires the following components for a type type:
      • std::tuple_size<type>::value has to return the number of elements.
      • std::tuple_element<idx,type>::type has to return the type of the idxth element.
      • A global or member get<idx>() has to yield the value of the idxth element. 
      • The standard library types std::pair<>, std::tuple<>, and std::array<> are examples of types that provide this API.
      • If structures or classes provide the tuple-like API, the API is used.
  • skip binding variable trick:
    // works only once within the same scope also applies to global namespace:
    auto [_,val1] = getStruct(); // OK
    auto [_,val2] = getStruct(); // ERROR: name _ already used

Apr 12, 2022

[C++] noexcept constructor

Reference:

Using noexcept as operator and making constructor as noexcept.

e.g.
template<typename T>
struct Holder
{
    T value;

template<typename... Args>
    Holder(Args&&... args)
        noexcept(noexcept(T(std::forward<Args>(args)...))) :
        value(std::forward<Args>(args)...) {}
};

Mar 16, 2022

[C++] memory model as for high performance concerns

Tips

The downside with sequential consistency is that it can hurt performance.
Use atomic with a relaxed memory model instead.




std::shared_ptr
To satisfy thread safety requirements, the reference counters are typically incremented using an equivalent of std::atomic::fetch_add with std::memory_order_relaxed (decrementing requires stronger ordering to safely destroy the control block).



Performance guidelines

  • correctly over performance
  • Avoid contention
  • Minimize the time spent in critical sections.
  • Avoid blocking operations (posix system call / sync call)
  • Be aware of number of threads/CPU cores
  • Thread priorities
    • Important for lowering the latency of tasks
  • Avoid priority inversion (ref: Golang's goroutine model; if one goroutine is starving in 1ms, has high priority)
    i.e A thread with high priority is waiting to acquire a lock that is currently held by a low-priority thread.
  • For real-time applications, we cannot use locks to protect any shared resources that need to be accessed by real-time threads.
    A thread that produces real-time audio, for example, runs with the highest possible priority, and in order to avoid priority inversion, it is not possible for the audio thread to call any functions (including std::malloc() ) that might block and cause a context switch.
  • Thread affinity;  a request to the scheduler that some threads should be executed on a particular core if possible, to minimize cache misses.
  • False sharing
    Pad each element in the array so that two adjacent elements cannot reside on the same cache line.
    Since C++17, there is a portable way of doing this using the std::hardware_destructive_interference_size constant defined in <new> in combination with the alignas specifier.

// Sqeeze data into same cache line thus true sharing
// Seperate data into different cache line thus avoid false sharing
// each vector element owns a cacheline
struct alignas(std::hardware_destructive_interference_size) Element {
	int counter_{};
};
auto elements = std::vector<Element>(num_threads);

Mar 3, 2022

[POSIX][C][C++] snprintf / sprintf can be thread safe thus slow

Reference:
https://aras-p.info/blog/2022/02/25/Curious-lack-of-sprintf-scaling
https://developer.arm.com/documentation/dui0492/i/the-c-and-c---libraries/thread-safe-c-library-functions
https://twitter.com/aras_p/status/1496489672373063682

snprintf / sprintf use global locale; which the function itself is thread-safe; thus slow and not scale(contains mutex)

take away:

  1. for converting int to string, use std::to_chars instead of snprintf
  2. {fmt} comes to the rescue (increase compile time)
  3. For using Double-checked_locking, be ware: use atomic with memory model, don't use simple bool/int to make it cross platform.
  4. don't use iostream for high-performance code
  5. always profiling.

Feb 10, 2022

[C++] ref data member is not const inside a const member function due to C++'s type system

Data member with reference type is not const inside a const member function due to C++'s type system.

i.e

c.v is decorated from right to left thus

int& const data_member;

is not valid.

struct Fun {
  int &a;
  void run() const { a += 1; }
};

int main() {

  int a = 42;
  [&a] { a += 1; }();

  Fun f{a};
  f.run();
}

Feb 9, 2022

[C++] what are allowed/no allowed in signal handler in C++17 (based on C11)

ISO reference

[C++/Rust] use of thread_local in code.

not allowed in signal handlers

  • Calling any standard library function (unless explicitly specified as signal-safe)
  • Calling new or delete (unless a safe memory allocator is used)
  • Using objects that are thread_local
  • Using dynamic_cast
  • Throwing an exception or entering a try block
  • Performing or waiting for the first initialization of a variable with static storage duration

allowed in signal handlers

  • abort() and _Exit()
  • quick_exit(), if the functions registered with at_quick_exit() are signal-safe
  • memcpy() and memmove()
  • All member functions of std::numeric_limits<>
  • All functions for std::initializer_lists
  • All type traits

[C++] CTAD tips

Reference:

Deduction guide sounds very reasonable for any class template that has a constructor that takes an object of its template parameter by reference.

e.g.
template<typename T>
struct C {
     C(const T&) {
     }
};

C x{"hello"}; // T deduced as char[6]

// with deduction guide
template<typename T> C(T) -> C<T>;

C x{"hello"}; // T deduced as const char*


Non-Template Deduction Guides

template<typename T>
struct S {
	T val;
};

S(const char*) -> S<std::string>; // map S<> for string literals to S<std::string>

S s1{"hello"}; // OK, same as: S<std::string> s1{"hello"};
S s2 = {"hello"}; // OK, same as: S<std::string> s2 = {"hello"};
S s3 = S{"hello"}; // OK, both S deduced to be S<std::string>

S s4 = "hello"; // ERROR: can’t initialize aggregates without braces
S s5("hello"); // ERROR: can’t initialize aggregates without braces


Deduction Guides versus Constructors

Deduction guides compete with the constructors of a class. 
Class template argument deduction uses the constructor/guide that has the highest priority according to overload resolution. 
If a constructor and a deduction guide match equally well, the deduction guide is preferred.

template<typename T>
struct C1 {
     C1(const T&) {}
};

C1(int) -> C1<long>;

// T deduced as long; the deduction guide is used because it is
// preferred by overload resolution.
C1 x1{42}; 

// T deduced as char; the constructor is a better match (because
// no type conversion is necessary)
C1 x3{'x'};


Explicit Deduction Guides

template<typename T>
struct S {
	T val;
};

explicit S(const char*) -> S<std::string>;

S s1 = {"hello"}; // ERROR (deduction guide ignored and otherwise invalid)

S s2{"hello"}; // OK, same as: S s2{"hello"};
S s3 = S{"hello"}; // OK
S s4 = {S{"hello"}}; // OK

template<typename T>
struct Ptr {
     Ptr(T) { std::cout << "Ptr(T)\n"; }
     template<typename U>
     Ptr(U) { std::cout << "Ptr(U)\n"; }
};


template<typename T>
explicit Ptr(T) -> Ptr<T*>;

Ptr p1{42}; // deduces Ptr due to deduction guide due to explicit.
Ptr p2 = 42; // deduces Ptr due to constructor
int i = 42;
Ptr p3{&i}; // deduces Ptr due to deduction guide due to explicit.
Ptr p4 = &i; // deduces Ptr due to constructor


Deduction Guides for Aggregates

template<typename T>
struct A {
	T val;
};

A i1{42}; // ok
A s1("hi"); // ok
A s2{"hi"}; // ok
A s4 = {"hi"}; // ok
A s3 = "hi"; // ERROR; `no viable conversion from 'const char[3]' to 'A'

// dedution guide
A(const char*) -> A<std::string>;

A s2{"hi"}; // OK
A s4 = {"hi"}; // OK
Note that (as usual for aggregate initialization) you still need curly braces. 
Otherwise, type T is successfully deduced but the initialization is an error:

A s1("hi"); // ERROR: T is string, but no aggregate initialization 
A s3 = "hi"; // ERROR: T is string, but no aggregate initialization
In any case, for a type with complicated constructors such as std::vector<> and other STL containers, it is highly recommended not to use class template argument deduction and instead, to specify the element type(s) explicitly.

std::vector v3{"hi", "world"}; // OK, deduces std::vector 
std::vector v4("hi", "world"); // OOPS: fatal runtime error



[C++] structure binding with auto&&

Reference:
https://devblogs.microsoft.com/oldnewthing/20201014-00/?p=104367

auto&& is universal reference thus collapse &&& to & with binding to l-value.


#include <iostream>
#include <string>
using namespace std;

struct Customer {
  string a;
  string b;
  int c;
};

int main() {
  Customer c{"Tim", "Starr", 42};
  auto [f, l, v] = c;

  std::cout << "f/l/v: " << f << ' ' << l << ' ' << v << '\n';

  // modify structured bindings via references:
  // auto&& is universal reference; thus for l-value it collapse from
  // &&& to &; auto = Customer&
  auto &&[f2, l2, v2] = c; 
  
  // auto&& is universal reference; here binds to r-value; auto = Customer
  auto &&[f2, l2, v2] = Customer{"a", "b", 42}; 

  cout << f2 << endl;
}

Dec 28, 2021

[C++] some move semantics wrap-up (interesting there's a thing i didn't realize :-) )

Reference:
The Hidden Secrets of Move Semantics - Nicolai Josuttis - CppCon 2020

https://www.youtube.com/watch?v=TFMKjL38xAI


  • const variable disables move semantics; i.e. std::move(const type); is having type of:
    const Type&& t;
    
    void take(Type&&);
    
    take(t) // invalid
  • Don't return const value from a function. This will hinder the receiver function which has the signature of:
    void take(Type &&);
    const Type getValue();
    void take(Type &&);
    
    take(getValue()); // invalid
    
  • const&& is possible, but a semantic contradiction; i.e no move constructor will take const&& type.
  • To always adopt values you can take by value and move
    If move is cheap (e.g., to init string members)
  • T&& and auto&& are universal/forwarding references
    Unless
     T is not a local function template parameter
    Unless the argument is const
    Unless T is a type expression (e.g., T::value_type)
  • std::string&& can be a universal reference in full specializations; i.e.
    template<typename T>
    void Fun(T&&);
    
    template<>
    void Fun(std::string&&); // universal reference with type std::string; NOT r-value reference.
    
  • Universal reference does not always forward
  • Getters should return by value
    or should be overloaded on reference qualifiers.
    Do NOT loop over getter return temporary object.
  • Use auto&& in a generic range-based for loop; this is due if looping over std::vector<bool> an temporary proxy object is created and auto& CANNOT bind to it.
  • The range-based for loop is broken
    when iterating over references to temporaries;
    Refer to: [C++] Object Lifetimes reading minute
  • Use std::move() for member functions that is decorated as &&(i.e. only works on r-value object)

Oct 23, 2021

[C++] memory model in depth

Reference:
https://www.codeproject.com/Articles/1183423/We-Make-a-std-shared-mutex-10-Times-Faster

https://en.wikipedia.org/wiki/Register_allocation#Spilling

https://en.wikipedia.org/wiki/MESIF_protocol

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

Concurrent Data Structures library:
https://github.com/khizmax/libcds


compound operations

i.e RMW(read-modify-write)

The operation a = a+1; consists of at least three mini-operations:

  1. Load the value of the variable "a" into the CPU register
  2. Add 1 to the value in the register
  3. Write the value of the register back into the variable "a"


3 ways to avoid data race

  1. Use atomic instructions with atomic variables; however, it's difficult to realize complex logic.
  2. Complex lock-free algorithms for each new container.
  3. Use locks. Admit 1 thread, one by one, to the locked code, so the problem of data-races does not arise and we can use arbitrary complex logic by using any normal not thread-safe objects.

Difference between std::atomic and volatile in C++11

1. Optimizations
For
std::atomic<T> a;
two optimizations are possible, which are impossible for volatile T a; // always spilled to memory.
Optimization of the fusion:
a = 10; a = 20;
can be replaced by compiler with
a = 20;
Optimization of substitution by a constant:
a = 1; local = a;
can be replaced by the compiler:
a = 1; local = 1;

2. Reordering
std::atomic<T> a;
operations can limit reordering around themselves for operations with the ordinary variables and operations with other atomic variables in accordance with the used memory barrier std::memory_order_...

volatile T a;
does not affect the order of regular variables (non-atomic / non-volatile), but the calls to all volatile variables always preserve a strict mutual order, i.e., the order of execution of any two volatile-operations cannot be changed by the compiler, and can by the CPU.

The compiler cannot reorder operations on volatile variables at compile-time, but the compiler allows the CPU to do this reordering at run-time.

3. Spilling
std::memory_order_release, std::memory_order_acq_rel, std::memory_order_seq_cst memory barriers, which are specified for 
std::atomic<T> a;
These barriers upload the regular variables from the CPU registers into the main memory/cache, except when the compiler can guarantee 100% that this local variable cannot be used in other threads.

4. Atomicity / alignment
For 
std::atomic<T> a;
other threads see that operation has been performed entirely or not performed at all.
For Integral types T, this is achieved by alignment of the atomic variable location in memory by compiler - at least, the variable is in a single cache line, so that the variable can be changed or loaded by one operation of the CPU.
Conversely, the compiler does not guarantee the alignment of the volatile variables.
Volatile variables are commonly used to access the memory of devices (or in other cases), so an API of the device driver returns a pointer to volatile variables, and this API ensures alignment if necessary.

5. Atomicity of RMW operations (read-modify-write)
For 
std::atomic<T> a;
operations ( ++, --, += , -= , *=, /=, CAS, exchange) are performed atomically,
i.e., if two threads do operation ++a; then the a-variable will always be increased by 2.
This is achieved by locking cache-line (x86_64) or by marking the cache line on CPUs that support LL/SC(Load-link/store-conditional) (ARM, PowerPC) for the duration of the RMW-operation.
Volatile variables do not ensure atomicity of compound RMW-operations.

There is one general rule for the variables std::atomic and volatile:
each read or write operation always calls the memory/cache, i.e. the values are never cached in the CPU registers.

Any optimizations and any reordering of independent instructions relative to each other done by the compiler or CPU are possible for ordinary variables and objects (non-atomic/non-volatile).

Recall that operations of writing to memory with atomic variables with std::memory_order_release, std::memory_order_acq_rel and std::memory_order_seq_cst memory barriers guarantee spilling (writing to the memory from the registers) of all non-atomic/non-volatile variables, which are in the CPU registers at the moment, at once: https://en.wikipedia.org/wiki/Register_allocation#Spilling


Changing the Order of Instructions Execution

The compiler and processor change the order of instructions to optimize the program and to improve its performance.
  1. compiler reordering
  2. x86_64 CPU reordering

Detail depict

Upon initiating the writing to the memory through
mov b[rip], 5
instruction, the following occurs:
First, the value of 5 and the address of b[rip] are placed in the store-buffer(sb) queue, the cache lines containing the address b[rip] in all the CPU cores are expected to be invalidated and a response from them is being waited.
Then CPU-Core-0 sets the “eXclusive” status for the cache line containing b[rip].
Only after that, the actual writing of the value of 5 from the Store-buffer is carried out into this cache line at b[rip].

In order not to wait all this time - immediately after “5” is placed in the Store-Buffer, without waiting for the actual cache entry, we can start execution of the following instructions: reading from the memory or registers operations. (i.e we could read the value directly from store-buffer instead of from cache)

More weak memory barriers, which allow reordering the instructions in the allowed directions. This allows the compiler and CPU to better optimize the code and increase the performance.

Barriers of Reordering of Memory Operations

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};
Practically will not use memory_order_consume barrier, because in the standard, there are doubts about the practicability of its usage:
(1.3) — memory_order_consume: a load operation performs a consume operation on the affected memory location. [ Note: Prefer memory_order_acquire, which provides stronger guarantees than memory_order_consume. Implementations have found it infeasible to provide performance better than that of memory_order_acquire. Specification revisions are under consideration. — end note ]

we note that memory_order_acq_rel barrier is used only for atomic compound operations of RMW (Read-Modify-Write), such as: compare_exchange_weak()/_strong(), exchange(), fetch_(add, sub, and, or, xor) or their corresponding operators.

The remaining four memory barriers can be used for any operations, except for the following: 
"acquire" is not used for store(), and “release” is not used for load(). (i.e acquire means read, store means write)

Requirements

  1. memory barriers give us what?
  2. what lock we want to achieve?
    First, spinlock, i.e std::mutex
    A spinlock means only allows one thread processes at the time.
    Such a code area is called the critical section. Inside it, you can use any normal code, including those without std::atomic<>.
  3. Memory barriers prevent the compiler from optimizing the program so that no operation from the critical section goes beyond its limits.

e.g.
The compiler optimizer is not allowed to move instructions from the critical section to the outside:
  • No instruction placed after memory_order_acquire can be executed before it.
  • No instruction placed before memory_order_release can be executed after it.
Any other changes in the order of execution of independent instructions can be performed by the compiler
(compile-time) or by the CPU (run-time) in order to optimize the performance.

The thread local dependencies are always stored in a way similar to that of single-threaded execution.
i.e
int a = 0
a = 1 + 42; // - 1
int b = a; // - 2
1 and 2 can not been reordered.

To realize locks (mutex, spinlock ...), we should use Acquire-Release semantics.
§ 1.10.1 (3)
… For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A.

Acquire-Release Semantic



The main point of the Acquire-Release semantics is that: Thread-2 after performing the flag.load(std::memory_order_acquire) operation should see all the changes to any variables/structures/classes (not even atomic ones) that have been made by Thread-1 before it executed the flag.store(0, std::memory_order_release) operation.

What exactly is the compiler doing in std::memory_order:

1,6: The compiler generates the assembler instructions acquire-barrier for the load operation and the release-barrier for the store operation, if these barriers are necessary for the given CPU architecture
2: The compiler cancels the previous caching of variables in the CPU registers in order to reload the values ​​of these variables changed by another thread - after the load(acquire) operation
5: The compiler saves the values ​​of all variables from the CPU registers to the memory so that they can be seen by other threads, i.e., it executes spilling - up to store(release)
3,4: The compiler prevents the optimizer from changing the order of the instructions in the forbidden directions - indicated by red arrows

With above knowledge, let's coin the spinlock class:
class spinlock_t {
    std::atomic_flag lock_flag;
public:
    spinlock_t() { lock_flag.clear(); }

    bool try_lock() { return !lock_flag.test_and_set(std::memory_order_acquire); }
    void lock() { for (size_t i = 0; !try_lock(); ++i)
                  if (i % 100 == 0) std::this_thread::yield(); }
    void unlock() { lock_flag.clear(std::memory_order_release); }
};


The following information about details in assembler x86_64, when the compiler cannot interchange the assembler instructions for optimization:
  • seq_cst. The main difference (Clang and MSVC) from GCC is when you use the Store operation for the Sequential Consistency semantics, namely:
    a.store (val, memory_order_seq_cst);
    in this case, Clang and MSVC generate the
    [LOCK] XCHG reg, [addr]
    instruction, which cleans the CPU store-buffer in the same way as the MFENCE barrier does.
    And GCC in this case uses two instructions
    MOV [addr], reg and MFENCE
  • RMW (CAS, ADD…) always seq_cst.
    As all atomic RMW (Read-Modify-Write) instructions on x86_64 have the LOCK prefix, which cleans the store-buffer, they all correspond to the Sequential-Consistency semantics at the assembler code level.
    Any memory_order for RMW generate an identical code, including memory_order_acq_rel.
  • LOAD(acquire), STORE(release).
    As you can see, on x86_64, the first 4 memory barriers (relaxed, consume, acquire, release) generate an identical assembler code - i.e., x86_64 architecture provides the acquire-release semantics automatically. Besides, it is provided by the MESIF (Intel) / MOESI (AMD) cache coherency protocols.
    This is only true for the memory allocated by the usual compiler tools, which is marked by default as Write Back (but it’s not true for the memory marked as Un-cacheable or Write Combined, which is used for work with the Mapped Memory Area from Device - only Acquire- Semantic is automatically provided in it).

Dependent operations cannot ever be reordered anywhere

Read-X – Read-Y
Read-X – Write-Y
Write-X – Write-Y



Acquire-Release vs. Sequential-Consistent total order



There is one more feature of the data exchange between the threads, which is manifested upon interaction of four threads or more. If at least one of the following operations does not use the most stringent barrier memory_order_seq_cst, then different threads can see the same changes in different order. For example:
  1. If thread-1 changed some value first
  2. And thread-2 changed some value second
  3. Then thread-3 can first see the changes made by thread-2, and only after that it will see the changes made by thread-1
  4. And thread-4, on the contrary, can first see the changes made by thread-1, and only after that, it will see the changes made by thread-2
This is possible because of the hardware features of the cache-coherent protocol and the topology of location of the cores in the CPUs. In this case, some two cores can see the changes made by each other before they see the changes made by other cores. In order that all threads could see the changes in the same order, i.e., they would have a single total order (C++ 11 § 29.3 (3)), it is necessary that all operations (LOAD, STORE, RMW) would be performed with the memory barrier memory_order_seq_cst

Acquire-Release Ordering


Acquire-Release vs. Sequential-Consistency reordering


Active Spin-Locks and Recursive-Spin-Lock

SC is SLOW.

Generates store-buffer cleaning (MFENCE x86_64) and, at x86_64 level, asm actually correspond to the slowest semantics of the Sequential Consistency.

There is a type of algorithm that is classified as write contention-free - when there is not a single memory cell in which it would be possible to write more than one thread.

In a more general case, there is not a single cache line in which it would be possible to write more than one thread. In order to have our shared-mutex be classified as write contention-free only in the presence of readers, it is necessary that readers do not interfere with each other - i.e., each reader should write a flag (which is read by it) to its own cell and remove the flag in the same cell at the end of reading - without RMW operations.

Write contention-free is the most productive guarantee, which is more productive than wait-free or lock-free.

It is possible that each cell is located in a separate cache line to exclude false-sharing, and it is possible that cells lie tightly - 16 in one cache line - the performance loss will depend on the CPU and the number of threads.

Before setting the flag, the reader checks if there is a writer - i.e., if there is an exclusive lock. And since shared-mutex is used in cases where there are very few writers, then all the used cores can have a copy of this value in their cache-L1 in shared-state (S), where they will receive the value of the writer’s flag for 3 cycles, until it changes.

For all writers, as usually, there is the same flag want_x_lock - it means that there is a writer at the moment. The writer threads set and remove it by using RMW-operations.


Oct 13, 2021

[C++] std::shared_mutex

Reference:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2406.html#shared_mutex_imp
https://stackoverflow.com/a/57709957


shared_mutex can certainly be implemented on top of an OS supplied read-write mutex. However a portable subset of the implementation is shown here for the purpose of motivating the existence of cond_var: the razor-thin layer over the OS condition variable.

A secondary motivation is to explain the lack of reader-writer priority policies in shared_mutex.

This is due to an algorithm credited to Alexander Terekhov which lets the OS decide which thread is the next to get the lock without caring whether a unique lock or shared lock is being sought. This results in a complete lack of reader or writer starvation. It is simply fair.


GCC on linux underneath using pthread_rwlock_t for the std::shared_lock implementation, to use kernel’s scheduler, there’s a header config:
_GLIBCXX_USE_PTHREAD_RWLOCK_T
to disable pthread_rwlock_t.

for pthread_rwlock_t,  by default it prefers read than write.

It can be config through:
PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP

Aug 3, 2021

[C++] note about std::shared_mutex and pthread_rwlock_t

Reference:
std::shared_mutex
pthread_rwlock_init
stackoverflow response:
https://stackoverflow.com/a/57709957
https://stackoverflow.com/a/2190271


Take away:

C++17's std::shared_mutex on linux might using pthread_rwlock_t underneath, thus in order to tweak

the behavior of write starvation, set PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP in the pthread_rwlock_init call's pthread_rwlockattr_t is necessary.

For those not using pthread_rwlock_t, std::shared_mutex should rely on linux kernel's scheduler, which is fair, avoid either write/read starvation.


Here's how Go handles stavation:

http://vsdmars.blogspot.com/2021/03/go-methods-for-lock-starvation-barging.html

Jul 23, 2021

[C++] consteval in C++17

Reference:
https://andreasfertig.blog/2021/07/cpp20-a-neat-trick-with-consteval/

What consteval does:
As the name of the keyword tries to imply, it forces a constant evaluation.
In the standard, a function that is marked as consteval is called an immediate function.
The keyword can be applied only to functions/function template.
Immediate here means that the function is evaluated at the front-end, yielding only a value, which the back-end uses.
Such a function never goes into your binary.
A consteval-function must be evaluated at compile-time or compilation fails.
With that, a consteval-function is a stronger version of constexpr-functions.



template <auto value>
inline constexpr auto as_constant = value;

constexpr int Calc(int x) { return 4 * x; }

int main() {
    auto res = as_constant<Calc(2)>;
    ++res;
}

Jan 26, 2021

[C++17] quick note about evaluation order

Reference: https://en.cppreference.com/w/cpp/language/eval_order

C++ 17 has settled the long time evaluation order issues in C/C++, 

the modified list:

  • For
    e1[ e2 ]
    e1.e2
    e1.*e2 
    e1->*e2 
    e1 << e2
    e1 >> e2
    
    e1 is guaranteed to be evaluated before e2 now, so that the evaluation order is left to right. However, note that the evaluation order of different arguments of the same function call is still undefined.
    That is, in
    e1.f(a1,a2,a3)
    
    e1.f is guaranteed to be evaluated before a1, a2, and a3 now.
    However, the evaluation order of a1, a2, and a3 is still undefined.
  • In all assignment operators
    e2 = e1
    e2 += e1
    e2 *= e1
    
    the right-hand side e1 is guaranteed to be evaluated before the left-hand side e2 now.
  • Finally, in new expressions like
    new Type(e)
    
    the allocation is now guaranteed to be performed before the evaluation e, and the initialization of the new value is guaranteed to happen before any usage of the allocated and initialized value.
  • Preprocessor Condition __has_include
    e.g
    #if __has_include(<filesystem>)
    # include <filesystem>
    # define HAS_FILESYSTEM 1
    #elif __has_include(<experimental/filesystem>) # include <experimental/filesystem>
    # define HAS_FILESYSTEM 1
    # define FILESYSTEM_IS_EXPERIMENTAL 1 #elif __has_include("filesystem.hpp") # include "filesystem.hpp"
    # define HAS_FILESYSTEM 1
    # define FILESYSTEM_IS_EXPERIMENTAL 1 #else
    # define HAS_FILESYSTEM 0
    #endif
    
    This could be handy and to replace cmake's library header existance check.
    However, be aware that the conditions inside __has_include(...) evaluate to 1 (true) if a corresponding #include command would be valid.
    Nothing else matters (e.g., the answer does not depend on whether the file was already included). Furthermore, the fact that the file exists does not prove that it has the expected contents.
    It might be empty or invalid.
    __has_include is a pure preprocessor feature. Using __has_include as condition in source code is not possible:
    if (__has_include(<filesystem>) { // ERROR }
    

[C++] trivial_abi and borrow ownership

An interesting concept comes up from Rust's borrow ownership.

In C++, the unique_ptr can only be moved. While instead of using heavy-weighted shared_ptr
(i.e 2 words size, heap allocated control block, atomic ref-counting, weak_ptr etc.), the only way of borrow the ownership from unique_ptr is by passing raw pointer from unique_ptr.

Here's the tweet from Prof. John Regehr talking about this:
https://twitter.com/johnregehr/status/1351333748277592065
Follow up from s.o:
https://stackoverflow.com/a/30013891
i.e in C++, the borrow concept can't be made in compile time, but in runtime, which unlike Rust.

Another to mention about is trivial_abi.
AMD64 ABI for C++: https://www.uclibc.org/docs/psABI-x86_64.pdf

While call by value,

Quote from ABI:
If a C++ object has either a non-trivial copy constructor or a non-trivial destructor, it is passed by invisible reference (the object is replaced in the parameter list by a pointer […]).

And in Itanium C++ ABI document, quote:

non-trivial for the purposes of calls
A type is considered non-trivial for the purposes of calls if: it has a non-trivial copy constructor, move constructor, or destructor, or all of its copy and move constructors are deleted.
This definition, as applied to class types, is intended to be the complement of the definition in [class.temporary]p3 of types for which an extra temporary is allowed when passing or returning a type. A type which is trivial for the purposes of the ABI will be passed and returned according to the rules of the base C ABI, e.g. in registers; often this has the effect of performing a trivial copy of the type.

That is call by value with non-trivial type introduce double indirection during the call while the type could potentially not fit into the register for callee.  By double indirection meaning the argument is reference to the temporary r-value created on caller's stack.
i.e potential virtual pointer to v-table increases the size of type.


Reference: https://www.raywenderlich.com/615-assembly-register-calling-convention-tutorial

Assembly code can be found in godbolt:
https://godbolt.org/z/s1TPea6sx


#include <cstdio>

#define TRIVIAL_ABI __attribute__((trivial_abi))

template <class T> T incr(T obj) {
  obj.value += 1;
  puts("before exit incr func");
  return obj;
}

struct Up1 {
  int value;
  Up1() = default;
  Up1(const Up1& u) : value(u.value) { puts("Up1 copy constructor"); }
  ~Up1() { printf("detroyed Up1 value: %d\n", value); }
};

struct TRIVIAL_ABI Up2 {
  int value;
  Up2() = default;
  Up2(const Up1& u) : value(u.value) { puts("Up2 copy constructor"); }
  ~Up2() { printf("detroyed Up2 value: %d\n", value); }
};

template Up1 incr(Up1);
template Up2 incr(Up2);

auto main() -> int {
  auto u1 = Up1{};
  puts("before call incr func for u1");
  incr(u1);

  printf("\n\n");

  auto u2 = Up2{};
  puts("before call incr func for u2");
  incr(u2);
}

Output:
before call incr func for u1
Up1 copy constructor; temporary object creatd;
before exit incr func
Up1 copy constructor; temporary object creatd;
detroyed Up1 value: 1
detroyed Up1 value: 1


before call incr func for u2
// No temporary object created, all in register.
before exit incr func
detroyed Up2 value: 1
detroyed Up2 value: 1
detroyed Up2 value: 0
detroyed Up1 value: 0