reading note for Anthony Williams' talk:
Safety: off How not to shoot yourself in the foot with C++ atomics
ABA:
https://en.wikipedia.org/wiki/Compare-and-swap#ABA_problem
https://en.wikipedia.org/wiki/ABA_problem
ABA problem
Some CAS-based algorithms are affected by and must handle the problem of a false positive match, or the ABA problem.
It's possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value.
The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.
A general solution to this is to use a double-length CAS (e.g. on a 32-bit system, a 64-bit CAS). The second half is used to hold a counter. The compare part of the operation compares the previously read value of the pointer *and* the counter, to the current pointer and counter.
If they match, the swap occurs - the new value is written - but the new value has an incremented counter. This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32-bit value, a multiple of 2^32 operations would have to have occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).
An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32-bit CAS, use a 16-bit index and a 16-bit counter. However, the reduced counter lengths begin to make ABA - especially at modern CPU speeds - likely.
One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.
A more complicated but more effective solution is to implement safe memory reclamation (SMR). This is in effect lock-free garbage collection. The advantage of using SMR is the assurance a given pointer will exist only once at any one time in the data structure, thus the ABA problem is completely solved. (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure. With SMR, only elements actually currently in the data structure will be accessed).
------------------------------------------------------------------------
std::atomic<T> provides an atomic type that can store
objects of type T.
- T can be a built in type, or a class type of any size
- T must be trivially copyable
- compare_exchange_xxx operations require that you can
compare T objects with memcmp
- std::atomic<T> may not be lock free — especially for
large types
- std::atomic_flag provides a guaranteed-lock-free flag
type.
- The Concurrency TS provides atomic_shared_ptr and
atomic_weak_ptr.
--
Atomic Operations:
General ops: <MAY NOT LOCK FREE>
load(), store(), exchange(),
compare_exchange_weak(),
compare_exchange_strong()
=
Arithmetic ops: <MAY NOT LOCK FREE>
for atomic<Integral> and atomic<T*>
- fetch_add(), fetch_sub()
++, --, +=, -=
Bitwise ops: <MAY NOT LOCK FREE>
for atomic<Integral>
- fetch_and(), fetch_or(), fetch_xor()
&=, |=, ^=
Flag ops: <GUARANTEED LOCK FREE>
for atomic_flag
- test_and_set(), clear()
--
Memory Ordering Constraints:
6 values for the ordering on an operation:
-memory_order_seq_cst (the default)
-memory_order_acquire
-memory_order_release
-memory_order_acq_rel (RMW ops only)
-memory_order_relaxed (Experts only)
-*memory_order_consume (Optimized form of memory_order_acquire, for special circumstances, for
experts only)
--
Fences:
C++ has two kinds of fences:
- std::atomic_thread_fence
Used for synchronizing between threads
- std::atomic_signal_fence
Used for synchronizing between a thread and a signal handler in that thread
Fences in C++ effectively modify the ordering constraints on
neighbouring atomic operations rather than providing any direct
ordering constraints themselves.
x.load(memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
=>x.load(memory_order_acquire);
atomic_thread_fence(memory_order_release);
x.store(memory_order_relaxed);
=>x.store(memory_order_release);
--
memory_order_acq_rel fences behave as both
memory_order_acquire and memory_order_release fences.
memory_order_seq_cst fences are special: they form part
of the total order of memory_order_seq_cst operations, and
can therefore enforce orderings beyond the direct pairwise
acquire-release orderings.
If you’re relying on this, you’ve probably done something wrong.
--
Lock-free terminology:
Obstruction free (Weakest guarantee):
If all other threads are paused then any given
thread will complete its operation in a bounded
number of steps.
Lock free (Most common guarantee):
If multiple threads are operating on a data
structure then after a bounded number of steps
one of them will complete its operation.
Wait free (Strongest guarantee):
(Is there a limitation? If no, it's not wait free)
(Consider all code path)
Every thread operating on a data structure will
complete its operation in a bounded number of
steps, even if other threads are also operating on
the data structure.
Lock-free vs. obstruction-free strictly depends on the
compare_exchange_weak implementation.
----------
Performance: Cache Ping-Pong:
Cache Ping-Pong is where a cacheline is continuously
shuttled back and forth between two processors.
This occurs when two threads(CPU Core) are accessing either:
- the same atomic variable
- different variables on the same cache line
This can have a big performance impact, because transferring
cache lines is **slow**
So:
The solution to cache ping-pong is to put data on different
cache lines by adding padding. This trades memory space
for performance.
e.g
std::atomic<unsigned> push_hint{0};
char padding1[padding_size];
entry* head{nullptr};
char padding2[padding_size];
std::atomic<entry*> tail{nullptr};
char padding3[padding_size];
entry buffer[buffer_size];
---------
**
You should use memory_order_seq_cst unless you have
a strong reason not to.
**
---------
For x86, only store is affected by the memory order, but for
architectures like POWER and ARM with weaker default
synchronization, all operations can be affected.
(x86CPU会自动处理store顺序,所以smp_wmb()原语什么也
不做,但是load有可能乱序,smp_rmb()和smp_mb()展开为lock;addl。)
You must test on a weakly-ordered system like POWER or
ARM if you’re using anything other than
memory_order_seq_cst.
---------
A stack is a simpler data structure than a queue. It’s great for
examples, !!but bad for real use!!, as all threads are contending to
access the top-of-stack.
---------
the A-B-A problem:
The setup:
1. A value changes from A to B and back to A,
2. Other aspects of the data structure have changed, and
3. A thread makes a change based on the first time the value
was A that is inconsistent with the new state of the data
structure.
This most commonly happens where the value is a pointer.
(Because it might consume the same virtual memory address)
--
A-B-A: Solutions:
Do not allow a variable to return to its previous value while a
thread can do something based on the old value.
Use a change count as part of the variable:
- struct Value{ T* ptr; unsigned count;};
- std::atomic<Value> v;
Ensure that objects are not recycled when still accessible,
so A-B-A never happens.
=> Reference count the objects, e.g. with
std::shared_ptr and atomic_shared_ptr or use
hazard pointers, or something similar.
---------
Guidelines
1. Don’t use atomics unless you have to
2. Profile before and after
3. Test on a weakly-ordered architecture such as POWER or
ARM
4. Don’t use atomics unless you really have to
5. Think in transactions
Do work off to the side and commit with a single
atomic operation.
6. Split big operations
If the operation is too big to do in one step, split it
into smaller steps that retain the data structure
invariants.
7. Limit use cases
Restrict the permitted concurrency levels where
possible to reduce implementation complexity.
8. Watch out for ABA problems
These require the circumstances to align just so,
but will destroy your data structure when they
happen. They can be easily missed in testing.
9. Avoid cache ping pong
Add padding between variables that are accessed
from different threads. Try and avoid too many
threads accessing the same variable.
10. Stick to memory_order_seq_cst
Unless you really know what you’re doing, and
really need the performance gain, stick to the
default memory_order_seq_cst. Anything else
can be a nightmare to prove correct.
11. Package things up
Wrap atomic operations with types that only
expose the desired functionality, to clarify the user
code and hide the complexity.
12. Aim for lock-free
Aim for your code to be at least obstruction-free,
and preferably lock-free. Leave wait-free for those
rare circumstances where you need it.
13. Avoid busy waits
If you’re actually waiting (as opposed to spinning
on a compare_exchange_weak operation), use
a proper wait mechanism.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.