Sep 16, 2018

[Concurrency] [C++][Go] Wrap up - 2018

Modified
  • Thie Core's cache line has the modified data.
  • Data in memory won't be in other Core's cache line.
Exclusive
  • Data aren't modified.
  • Data in memory is the latest.
  • If the Core has to evict the data inside the cache line,
    nothing need to be write back to memory.
Shared
  • Data are shared between Cores' cache line.
  • If this core has to modify the data, it needs to ask for data from other cores first.
Invalid
  • The data in the cache line is null.

Operations:

Read
  • read data from cache lines. Ask for other cores' for data.  
Read response
  • Data for read request. Either from Memory or from Cores' cache.
Invalidate
  • Invalidate the particular data inside Cores' cache lines.
Invalidate ack
  • Response to invalidate request that the request is in the queue.      
Read invalidate
  • Read + Invalidate.
  • Will get read's data response and invalidate ack.
Writeback
  • Write the data from cache line to memory.

Store buffer:

  • Read invalidate request to other Cores' cache line for the to be modified data  is not necessary since the data response isn't needed because this Core is going to modify the data anyway.
  • Thus, this core will write data to Store buffer first.
  • Thus, Core will read data inside the Store Buffer with highest priority if the data exist in Store buffer and Cache line.
  • However, there's an issue for Read invalidate request to other cores.
    Consider that data A in Core 1 cache, data B in Core 2 cache.
    And data A, data B has this happen before relationship.
    i.e
    In Core 1, update data A(Core 1), then update data B(Core 2).
    In Core 2, read data A(Core 1), verify data B(Core 2).
 
You see the problem. Between operation on A/B there's interleaves.
 
Core 1 updates data B, write into store buffer, send read invalidate to Core 2 on data B, at the same time, Core 2 reads data A, receives data A from Core 1, and verify data B as well, at the time, data B isn't invalidated yet.

Thus we need some wait mechanism.
i.e memory barrier.

In the above case, we need
  • WMB(write memory barrier) on Core 1,
  • RMB(read memory barrier) on Core 2. 
For Core 1, we do:
Update B, then WMB, then update data A.

For Core 2, we do:
Read data A, if A is old data, verify B won't happen, and if A is new data.
i.e
Core 1 updated data A, data B in Core 2 SHOULD use new one from Core 1, since it's invalidate queued.
But how to trigger Core 2 to verify invalidate queue? Before verify data B in Core 2, call RMB.
So the sequence for Core 2 will become:
Read Data A, RMB, verify data B.
     
 
WMB:
  • Send read invalidate to Core 2 and till Core 2 send back invalidate ack will Core 1 flush data A from store buffer to it's cache line.
RMW:
  • Verify invalidate queue, make data that is in the invalidate queue invalid the Core 2's cache line.
  • While Core 2 tries to read the data, it will send a 'read' request to Core 1 for data B.

Take away:
  • WMB should be issued AFTER a write to the shared data.
  • RMW should be issued BEFORE a read to the shared data.
     
Reference:
  1. Memory Barriers: a Hardware View for Software Hackers
  2. https://en.cppreference.com/w/cpp/atomic/memory_order
  3. https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence

Code:

#include <atomic>
#include <iostream>
#include <thread>

using namespace std;
atomic<int> A{0};
atomic<int> B{0};

void t1()
{
    this_thread::sleep_for(1s);
    B.store(38, memory_order_relaxed);

    int a = 42;
    // WMB, prevents 'a' passing A.store.
    A.store(a, memory_order_release);
}

void t2()
{
    // RMB
    while (A.load(memory_order_acquire) != 42) {
        cout << "in while" << endl;
        // Prints 0 or 38 if B.store hasn't process yet.
        cout << B.load(memory_order_relaxed) << endl;
    }
    cout << "out while" << endl;
    // Always print 38.
    cout << B << endl;
}
int main()
{
    thread T1{t1};
    thread T2{t2};
    T1.join();
    T2.join();
}
Fence:
#include <atomic>
#include <iostream>
#include <thread>
using namespace std;
atomic<int> A{0};
atomic<int> B{0};

void t1()
{
    A.store(42, memory_order_relaxed);
    atomic_thread_fence(memory_order_release);
    // B.store can NEVER before A.store.
    B.store(38, memory_order_relaxed);
}

void t2()
{
    while (B.load(memory_order_relaxed) == 38) {
        cout << "in while loop\n";
        cout << A << endl; // Must be 42
        break;
    }
    cout << "out while loop\n";
}

int main()
{
    thread T1{t1};
    thread T2{t2};
    T1.join();
    T2.join();
}


Fence:

Release/Store
  • Prevents all preceding memory operations from being reordered past subsequent writes.
  • Prevents all following memory operations from being memory reordered before the write.
Acquire/Load
  • Prevents all following memory operations from being reordered before this fence.
  • Prevents all preceding memory operations from being reordered pass this fence.

Memory fences are NOT an acquire or release operation.


Operation:

Release operation:  store
  • Cannot be reordered by compiler.
  • Prevents preceding memory operations from being reordered past itself.
    i.e Any operations after a store can be reordered before the store operation.
  • Any read or write operation that precedes it in program order.
    i.e those in memory_order_relaxed mode can't be reordered.
     
Acquire operation:  load
  • https://en.cppreference.com/w/cpp/atomic/atomic_load
  • Cannot be reordered by compiler.
  • Any read or write operation that follows it in program order.
    i.e those in memory_order_relaxed mode can't be reordered.
  • Those before the load can be memory ordered pass the load. Not as strong as fence.
BUT with different variable(object):
A release operation followed by a acquire operation CAN be reordered.
A acquire operation followed by a release operation CAN be reordered.
i.e
    A.store(1, std::memory_order_release);
    int b = B.load(std::memory_order_acquire);
=>
    int b = B.load(std::memory_order_acquire);
    A.store(1, std::memory_order_release);

However, keep in mind that (different variable/object) even reorder is OK for release/acquire operation, standard also depicts:
http://eel.is/c++draft/intro.multithread#intro.progress-18
An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
Reference:
https://stackoverflow.com/questions/8819095/concurrency-atomic-and-volatile-in-c11-memory-model/8833218#8833218

i.e
If an store operation follows a for/while loop and a load operation,
the compiler shouldn't reorder load before store due to it can't
reasoning that the for/while loop is finite thus other thread can see thread store result in a finite time.
   
Reference:
  1. [Preshing] Can Reordering of Release/Acquire Operations Introduce Deadlock?
  2. [Bruce Dawson] In Praise of Idleness
  3. [Golang bug list] cmd/compile: go1.8 regression: sync/atomic loop elided #19182

Hardware Memory Model:

Weak memory model
  • ARM/Power PC
Strong memory model
  • x86/64
  • Sequential Consistence (software)

Reordering:

Sequentially Consistent types:
  • Java: volatile variables
  • C++11: atomic (default)
Explicit Compiler Barriers:
gcc:
    asm volatile("" ::: "memory");
Macro:
    #define COMPILER_BARRIER() asm volatile("" ::: "memory")

source:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/barrier.h#L22





Don't consider Sequential Consistency is SLOW; correctness is the highest priority.








Implied Compiler Barriers:
C++11:
  • Every non­-relaxed atomic operation acts as a compiler barrier.
    這裡是compiler reordering, 不是memory reordering.
  • Every function containing a compiler barrier must act as a compiler barrier itself, even the function is inlined.
  • The majority of function calls act as compiler barriers, whether they contain their own compiler barrier or not due to it's coming from different TU. 
  • But does not include inline functions, functions declared with the pure attribute, and cases where link ­time code generation is used.          
Reference:
  1. [Implications of pure and constant functions] https://lwn.net/Articles/285332/

What about Golang?

goroutine act as 'user space thread', 
i.e it's runtime memory location is allocated on the heap.

Inside a single goroutine, the compiler is allowed to re-arrange
expressions as long as the reordering does not change the behavior within that goroutine as defined by the language specification. 

Within a single goroutine, the happens-before order is the order expressed by the program.
i.e
a := 42
if a == 42 {
    // always true.
}

// b will always be 42
b := a
// within other goroutines, the observe of a == 42 and c == 38 sequence is not guaranteed.
c := 38  

Reads and writes of values larger than a single machine word behave as multiple machine-word-sized operations in an unspecified order.


Initialization:
  • If a package p imports package q, the completion of q's init functions happens before the start of any of p's.
  • The start of the function main.main happens after all init functions have finished.
Goroutine creation:
  • The go statement that starts a new goroutine happens before the goroutine's execution begins.
  • i.e go statement act as a barrier function call in C/C++ which won't be reordered.
Goroutine destruction:
  • Can happen any time.
  • Be ware that if a goroutine that updates a global value and do nothing (i.e using channel etc.), an aggressive compiler could delete the goroutine entirely for optimization.
    i.e just modify the global variable without creating a goroutine.
    https://github.com/golang/go/issues/19182

Channel communication:
  • A send on a channel happens before the corresponding receive from that channel completes.
  • The closing of a channel happens before a receive that returns a zero value because the channel is closed.
  • A receive from an un-buffered channel happens before the send on that channel completes.
  • The kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.
i.e the famous:
Do not communicate by sharing memory; instead, share memory by communicating.
No more memory store/load operation from programmer's point of view.
The Channel acted as the sequential consistence/memory fence contract.


Reference:
[Preventing stack guard-page hopping] https://lwn.net/Articles/725832/

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.