Ataraxia through Epoché: [C++11][NOTE] About memory barrier

Reference:

Memory Barriers Are Like Source Control Operations
Why Memory Barrier？
CPU Cache Flushing Fallacy
Memory Barriers/Fences
The SC- Memory Model for Java

synchronize-with == happens-before

Each cache-line can be in 1 of the 5 following states:
Modified:
Indicates the cache-line is dirty and must be written back to memory at a later stage.
When written back to main-memory the state transitions to Exclusive.

Exclusive:
Indicates the cache-line is held exclusively and that it matches main-memory.
When written to, the state then transitions to Modified.
To achieve this state a Read-For-Ownership (RFO) message is sent which
involves a read plus an invalidate broadcast to all other copies.

Shared:
Indicates a clean copy of a cache-line that matches main-memory.

Invalid:
Indicates an unused cache-line.

Forward:
Indicates a specialised version of the shared state
i.e. this is the designated cache which should respond to other caches in a NUMA system.

modified状态：
该cache-line包含修改过的数据，内存中的数据不会出现在其他CPU-cache中，
此时该CPU的cache中包含的数据是最新的

exclusive状态：
与modified类似，但是数据没有修改，表示内存中的数据是最新的。
如果此时要从cache中剔除数据项，不需要将数据写回内存

shared状态：
数据项可能在其他CPU中有重复，CPU必须在查询了其他CPU之后才可以向该cache-line写数据

invalid状态：表示该cache-line空

read:
使收到read的CPU cacheline的值標記為Share狀態

read response:
包含READ请求的数据，要么由内存满足要么由cache满足

invalidate:
發出對將要存入的Share狀態下的值其處的CPU Cacheline的invalidate請求.
其處的CPU Cacheline將會將值標記為 invalid並queue住

invalidate ack:
收到invalidate消息的CPU cacheline的值將回覆ack給發送invalidate者
並queue這個invalidate的值

read invalidate:
當要改某個數據, 而此數據在其他CPU的cache line內,
將會先對這其他CPU發出 read invalidate訊息, 避免其他CPU
讀取此數據.

writeback：
包含要写回的数据和地址，该状态将处于modified状态的lines写回内存，为其他数据腾出空间

--------
#StoreStore
StoreStore barrier避免該barrier之前的store發生在barrier之後.
即將直寫入store buffer並發出invalidate request
此時並未收到invalidate ack, 故還沒有flush至cache
且之後的store也不能被執行! Store buffer很小, 需要等到invalidate ack來flush, 空出
store buffer.

而其他Core想要讀取該值時, 會因收到invalidate request而有invalidate queue,
故不會讀到舊值
並不保證在barrier之前的store已經flush至cache/memory.
在收到invalidate ack後會從store buffer flush至cache.

--------
#LoadLoad
LoadLoad barrier避免在其之後的load operation發生在該barrier之前.
因為有invalidate queue, 故不會讀到舊值.
不一定讀到最新的值, 因為invalidate queue告知要去新的Core的cache line
裡讀, 然, 新的Core cache line內的值可能還沒有從store buffer中flush至
cache line. 故, 不一定讀到最新的值.

--------
#LoadStore:
當cpu想要load一個值, 而下一個store值與此load值無關,
則可以reorder store值至load值前面.
store可能是cache hit, 而load是cache miss. 這樣的reoder提昇效率.
而LoadStore barrier必面這樣的reorder.

--------
#StoreLoad
StoreLoad barrier使所有在其之前的store operation被其他Core看見.
StoreLoad barrier使所有在其之後的load operation看到當前最新的值.
StoreLoad barrier避免reorder所有的store / load 相對於此barrier.

--------
--------
--------

Don't write a race condition or use non-default atomics and your code will do what you think.
Sequential consistency(SC)-Data-Race-Free(DRF) (C++11/C11 default, unless using relax atomics) as long as don't write race condition.
Race condition
compiler should consider opaque function call a _FULL_ barrier , while no
code before or after it should be moved.
Acquire and Release come in a package deal

atomic: acquire
atomic: release
--------
thread 1 => release, write from store buffer to cache-line after get
ack from invalid-queue, WMB
thread 2 => acquire, process invalid-queue, invalid the value, read the value from other CPU-core, RMB

Special OP, Compare-and-Swap(CAS), in C++11, it is written as:

compare_exchange_*
Am i the only one who gets to change val from expected to desired?
Often written in loop: CAS Loop
Prefer _weak when going to write a CAS loop
Almost always want _strong when doing a single test

Full MB(fence) , nothing can be re-ordered
Acquire-Release , outside can move in, but not moving out acrossly
Release-barrier, do not move anything up-across the barrier
Acquire-barrier, do not move anything below-across the barrier
!Standalone MB prevents local variable moving across optimization
volatile has nothing to do with MB (http://en.cppreference.com/w/cpp/language/cv)
(http://stackoverflow.com/a/4437555)

volatile object - an object whose type is volatile-qualified, or a subobject of a volatile object, or a mutable subobject of a const-volatile object. Every access (read or write operation, member function call, etc.) on the volatile object is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatileaccesses cannot be reordered or optimized out. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to refer to a volatile object through a non-volatile glvalue (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.

if global var is atomic, moving across MB is possible.
Each global variable should protected by a mutex{M}; however, that is not enough. Make sure, make sure, that variable's memory location being each time written, also protected by the mutex{M}.
Beware of padding.

struct {int c:9;int d:7;};
adjacent bit-fields are _one_ object.

prevent deadlock:

Mutual Exclusion
Hold and Wait or Resource Holding
No Preemption
Circular Wait

conditional write:

if(xxx) x=10;

Conditional lock:

your code conditionally takes a lock, but your system has _bug_ that changes a conditional write to be unconditional

replace one function having a doOptionalWork flag with 2 functions

one function always takes a lock and does the x-related work
one function _never_ takes a lock _or_ touches the x.

take a lock for any variables you mentioned anywhere in a region of code

even if updates are conditional, and by SC reasoning you could believe you won't reach that code on some paths and so won't need the lock
Well, this is pretty useless if you have nested library calls.

**
Function call can also works as a barrier of preventing code sequence change.
**

The role of a release fence, as defined by the C++11 standard,
is to prevent previous memory operations from moving past subsequent stores.

C11/C++11:
Adjacent bitfields are ONE object.

struct Test
{
int a : 9;
int b : 7;
} stest;

stest.a = 1;
stest.b = 2;

in different thread without protecting will introduce problem.

struct will have the size of multiple number of variables of the larger type.
Reference:
The Lost Art of C Structure Packing

SC-DRF tag:
Reference:
C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 1 of 2
C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2

--------
Conditional locks:

Problem:
Your code conditionally takes a lock, but your system has a
bug that changes a conditional write to be unconditional.

Option 1:
In code like we’ve seen, replace one function having a
doOptionalWork flag with two functions (possibly overloaded):

One function always takes the lock and does the x-related work.

One function never takes the lock or touches x.

Option 2:
Pessimistically take a lock for any variables you mention
anywhere in a region of code.

Even if updates are conditional, and by SC reasoning you could believe
you won’t reach that code on some paths and so won’t need the lock.

This option is pretty useless if you have nested library calls.

! ld.acq and st.rel are a package deal ! => TEST!!

Reference:
memory_order

--------
X86_64:
load: sc atomic : mov

• X_ Reads are not reordered with any reads.
• X_ Writes are not reordered with any writes [some exceptions]
• X_ Writes are not reordered with older reads.
• **Reads may be reordered with older writes [different locations].
• X_ Reads & writes not reordered with locked instructions [like xchg; ...].
-----------------------------------------------------------------------------------------
• **Reads cannot pass earlier LFENCE and MFENCE.
• **Writes cannot pass earlier LFENCE, SFENCE, and MFENCE.
• LFENCE cannot pass earlier reads.
• SFENCE cannot pass earlier writes.
• **MFENCE cannot pass earlier reads or writes.

store: sc atomic : xchg : which is full barrier (we only need SC release)

Reference:
What does `std::kill_dependency` do, and why would I want to use it?

----------
GotW #95 Solution: Thread Safety and Synchronization

Ataraxia through Epoché

Dec 31, 2013

[C++11][NOTE] About memory barrier

No comments:

Post a Comment