Ataraxia through Epoché: [C++][CppCon2015] [Note] Live lock-free or deadlock

(from book: The Art of Multiprocessor Programming)

Wait-free programs: each thread will complete its task in finite number of steps (finite amount of time) no matter what other threads are doing
— At all times, all threads make progress toward the final result
Lock-free programs: programs without locks (duh!); at least one thread makes progress no matter what other threads are doing
Obstruction-free program: thread makes progress as long as no other thread accesses shared data at the same time
Lock-based programs are not guaranteed to make progress toward the result in all cases
— In particular, consider what would happen if the thread holding the lock was suspended by the OS?

Problems With Locks

(Not a problem) Locks are simpler than any alternative
— Most programs are not lock-free
— Know problems with locks not just to avoid locks but also to avoid
problems when using locks

Performance of a well-written lock program may be no different from that of a lock-free program
Lock-based program ≠ mutex-based program!
Overhead of data sharing depends on the amount of data
Mutexes have high overhead – a lot of shared data
Spinlocks have one shared word
— Plus the shared data they protect
If a lock-free program needs to share several pieces of data to arrange lock-free synchronization, it may end up being slower than a program with spinlocks

Practical problems with locks

In practice a lot of these problems may not matter
— I write HPC code, I don’t have priority threads, I just want the answer as fast as possible.
— Sure, the thread holding the lock can get suspended, but how often does it really happen?
— If a lock protects an atomic transaction with no user code inside, this cannot deadlock or livelock.
— I usually know what can and cannot be in shared memory
In practice, the main motivation to get rid of locks is performance
— Locks can make your program slow
— Lock-free does not always make it faster
In some special cases lock-free program may be simpler!

In practice, the main motivation to get rid of locks is performance

Getting rid of locks does not automatically make the program faster
How to atomically increment two 64-bit integers?
— Some CPUs have 128-bit CAS (DCAS)
— There is a really clever algorithm with CAS and a 3 rd integer
— Do it under a spinlock or use atomic exchange and busy-wait
— On a wide range of x86 systems, even under heavy congestion spinlock is faster than CAS for up to 40-64 cores and always faster than DCAS
The fastest is to not atomically increment two 64-bit integers

Practical Problems with Lock-Free Programs

(and practical solutions)

Wait-Free and Lock-Free programs do not suffer from deadlocks, livelocks, priority-inversion or convoying; they can be reentrant and signal-safe; they can support shared memory.
Lock-Free and Wait-Free algorithms are usually very hard to write, understand, and maintain.

minimize data sharing at the algorithm level.
use spinlock (or equivalent) for accessing shared data.
avoid designing lock-free algorithms, design lock-free data structures, it’s easier and usually is what you want.
design only what you need, avoid generic designs
— Does not work if you are writing STL

Generic lock-free queue is very hard to write

Problems:
— Another atomically incremented counter for “front”
— What to do when queue is empty
— What to do when queue runs out of memory
Generic lock-free queue is not always faster than a well-written non-thread-safe queue with a spinlock

Data Sharing is Expensive

Non-thread-safe data access is faster than synchronized data access.
— Simplest case: non-atomic vs atomic access
— Avoids the unknown overhead of locking
Why is synchronized access more expensive?

Cache coherency must be maintained in hardware.
Side note: circuits for caches and their maintenance are the largest part of modern CPUs and the limiting factor in increasing the number of cores.
— Some CPU architectures give up on cache coherency altogether for that reason (Cell processor)

Performance myth:
Comparing general lock-free queue with spinlocked
wrapper around std::queue

Results are inconsistent, varies a lot between CPU architectures (Haswell, Nehalem, Sandy Bridge) and machine configurations (number of CPUs, number of sockets)
Results depend strongly on access patterns and congestion
If there is a trend, it’s slightly in favor of std::queue (on the set of machines I have access to and set of access patterns tested)
Subtle implementation details matter, such as having top and bottom pointers on separate cache lines

Special-case queues can be much faster :

Generic lock-free queue is very hard to write

Problems:
— Another atomically incremented counter for “front”
— What to do when queue is empty
— What to do when queue runs out of memory
Generic lock-free queue is not always faster than a well-written non-thread-safe queue with a spinlock
But a specialized lock-free queue often is!

What exactly is a thread-safe queue, anyway??
(Exam the interface!)

Each member function is thread-safe
A thread-safe data structure can, at most, guarantee that each member function call is a transaction
— Whether lock-free or guarded by a lock
That transaction must be useful to the caller

Ataraxia through Epoché