Ataraxia through Epoché: [C++][Atomic] How not to shoot yourself in the foot with C++ atomics -- reading note.

reading note for Anthony Williams' talk:

Safety: off How not to shoot yourself in the foot with C++ atomics

https://isocpp.org/blog/2015/06/safety-off-how-not-to-shoot-yourself-in-the-foot-with-cpp-atomics-anthony-w

ABA:
https://en.wikipedia.org/wiki/Compare-and-swap#ABA_problem
https://en.wikipedia.org/wiki/ABA_problem

ABA problem
Some CAS-based algorithms are affected by and must handle the problem of a false positive match, or the ABA problem.
It's possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value.
The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.

A general solution to this is to use a double-length CAS (e.g. on a 32-bit system, a 64-bit CAS). The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer *and* the counter, to the current pointer and counter.
If they match, the swap occurs - the new value is written - but the new value has an incremented counter. This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32-bit value, a multiple of 2^32 operations would have to have occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).

An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32-bit CAS, use a 16-bit index and a 16-bit counter. However, the reduced counter lengths begin to make ABA - especially at modern CPU speeds - likely.

One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.

A more complicated but more effective solution is to implement safe memory reclamation (SMR). This is in effect lock-free garbage collection. The advantage of using SMR is the assurance a given pointer will exist only once at any one time in the data structure, thus the ABA problem is completely solved. (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure. With SMR, only elements actually currently in the data structure will be accessed).

------------------------------------------------------------------------

std::atomic<T> provides an atomic type that can store

objects of type T.

- T can be a built in type, or a class type of any size

- T must be trivially copyable

- compare_exchange_xxx operations require that you can

compare T objects with memcmp

- std::atomic<T> may not be lock free — especially for

large types

- std::atomic_flag provides a guaranteed-lock-free flag

type.

- The Concurrency TS provides atomic_shared_ptr and

atomic_weak_ptr.

Atomic Operations:

General ops: <MAY NOT LOCK FREE>

load(), store(), exchange(),

compare_exchange_weak(),

compare_exchange_strong()

Arithmetic ops: <MAY NOT LOCK FREE>

for atomic<Integral> and atomic<T*>

- fetch_add(), fetch_sub()

++, --, +=, -=

Bitwise ops: <MAY NOT LOCK FREE>

for atomic<Integral>

- fetch_and(), fetch_or(), fetch_xor()

&=, |=, ^=

Flag ops: <GUARANTEED LOCK FREE>

for atomic_flag

- test_and_set(), clear()

Memory Ordering Constraints:

6 values for the ordering on an operation:

-memory_order_seq_cst (the default)

-memory_order_acquire

-memory_order_release

-memory_order_acq_rel (RMW ops only)

-memory_order_relaxed (Experts only)

-*memory_order_consume (Optimized form of memory_order_acquire, for special circumstances, for

experts only)

Fences:

C++ has two kinds of fences:

- std::atomic_thread_fence

Used for synchronizing between threads

- std::atomic_signal_fence

Used for synchronizing between a thread and a signal handler in that thread

Fences in C++ effectively modify the ordering constraints on

neighbouring atomic operations rather than providing any direct

ordering constraints themselves.

x.load(memory_order_relaxed);

atomic_thread_fence(memory_order_acquire);

=>x.load(memory_order_acquire);

atomic_thread_fence(memory_order_release);

x.store(memory_order_relaxed);

=>x.store(memory_order_release);

memory_order_acq_rel fences behave as both

memory_order_acquire and memory_order_release fences.

memory_order_seq_cst fences are special: they form part

of the total order of memory_order_seq_cst operations, and

can therefore enforce orderings beyond the direct pairwise

acquire-release orderings.

If you’re relying on this, you’ve probably done something wrong.

Lock-free terminology:

Obstruction free (Weakest guarantee):

If all other threads are paused then any given

thread will complete its operation in a bounded

number of steps.

Lock free (Most common guarantee):

If multiple threads are operating on a data

structure then after a bounded number of steps

one of them will complete its operation.

Wait free (Strongest guarantee):

(Is there a limitation? If no, it's not wait free)

(Consider all code path)

Every thread operating on a data structure will

complete its operation in a bounded number of

steps, even if other threads are also operating on

the data structure.

Lock-free vs. obstruction-free strictly depends on the

compare_exchange_weak implementation.

----------

Performance: Cache Ping-Pong:

Cache Ping-Pong is where a cacheline is continuously

shuttled back and forth between two processors.

This occurs when two threads(CPU Core) are accessing either:

- the same atomic variable

- different variables on the same cache line

This can have a big performance impact, because transferring

cache lines is **slow**

So:

The solution to cache ping-pong is to put data on different

cache lines by adding padding. This trades memory space

for performance.

e.g

std::atomic<unsigned> push_hint{0};

char padding1[padding_size];

entry* head{nullptr};

char padding2[padding_size];

std::atomic<entry*> tail{nullptr};

char padding3[padding_size];

entry buffer[buffer_size];

---------

You should use memory_order_seq_cst unless you have

a strong reason not to.

---------

For x86, only store is affected by the memory order, but for

architectures like POWER and ARM with weaker default

synchronization, all operations can be affected.

(x86CPU会自动处理store顺序，所以smp_wmb()原语什么也

不做，但是load有可能乱序，smp_rmb()和smp_mb()展开为lock;addl。)

You must test on a weakly-ordered system like POWER or

ARM if you’re using anything other than

memory_order_seq_cst.

---------

A stack is a simpler data structure than a queue. It’s great for

examples, !!but bad for real use!!, as all threads are contending to

access the top-of-stack.

---------

the A-B-A problem:

The setup:

1. A value changes from A to B and back to A,

2. Other aspects of the data structure have changed, and

3. A thread makes a change based on the first time the value

was A that is inconsistent with the new state of the data

structure.

This most commonly happens where the value is a pointer.

(Because it might consume the same virtual memory address)

A-B-A: Solutions:

Do not allow a variable to return to its previous value while a

thread can do something based on the old value.

Use a change count as part of the variable:

- struct Value{ T* ptr; unsigned count;};

- std::atomic<Value> v;

Ensure that objects are not recycled when still accessible,

so A-B-A never happens.

=> Reference count the objects, e.g. with

std::shared_ptr and atomic_shared_ptr or use

hazard pointers, or something similar.

---------

Guidelines

1. Don’t use atomics unless you have to

2. Profile before and after

3. Test on a weakly-ordered architecture such as POWER or

ARM

4. Don’t use atomics unless you really have to

5. Think in transactions

Do work off to the side and commit with a single

atomic operation.

6. Split big operations

If the operation is too big to do in one step, split it

into smaller steps that retain the data structure

invariants.

7. Limit use cases

Restrict the permitted concurrency levels where

possible to reduce implementation complexity.

8. Watch out for ABA problems

These require the circumstances to align just so,

but will destroy your data structure when they

happen. They can be easily missed in testing.

9. Avoid cache ping pong

Add padding between variables that are accessed

from different threads. Try and avoid too many

threads accessing the same variable.

10. Stick to memory_order_seq_cst

Unless you really know what you’re doing, and

really need the performance gain, stick to the

default memory_order_seq_cst. Anything else

can be a nightmare to prove correct.

11. Package things up

Wrap atomic operations with types that only

expose the desired functionality, to clarify the user

code and hide the complexity.

12. Aim for lock-free

Aim for your code to be at least obstruction-free,

and preferably lock-free. Leave wait-free for those

rare circumstances where you need it.

13. Avoid busy waits

If you’re actually waiting (as opposed to spinning

on a compare_exchange_weak operation), use

a proper wait mechanism.

Ataraxia through Epoché

Jun 19, 2015

[C++][Atomic] How not to shoot yourself in the foot with C++ atomics -- reading note.

No comments:

Post a Comment