Eben Freeman has given a great talk about Golang's memory management
This is a note for the talk and knowledge add-ons.
Reference:
GopherCon 2018 - Allocator Wrestling https://about.sourcegraph.com/go/gophercon-2018-allocator-wrestling
Knowledge from C++ world:
While coding in Go, which has escape analysis provided by the compiler,
consider writing the type instance as auto variable which avoid compiler putting
it into the heap.
Don't escape unnecessary auto variables by returning it's address.
Return by value is preferred if the type size is small.
(i.e If the type is large, it's instance will be put into heap anyhow.)
Since auto variables are not managed by GC :-)
Arenas are coarse sections of the available address space (64MB on 64-bit archs).
We allocate OS memory in units of arenas.
For each arena, there's metadata in a heapArena struct:
https://github.com/golang/go/blob/master/src/runtime/mheap.go#L201
This is a note for the talk and knowledge add-ons.
Reference:
GopherCon 2018 - Allocator Wrestling https://about.sourcegraph.com/go/gophercon-2018-allocator-wrestling
Knowledge from C++ world:
While coding in Go, which has escape analysis provided by the compiler,
consider writing the type instance as auto variable which avoid compiler putting
it into the heap.
Don't escape unnecessary auto variables by returning it's address.
Return by value is preferred if the type size is small.
(i.e If the type is large, it's instance will be put into heap anyhow.)
Since auto variables are not managed by GC :-)
Allocator internals
Design goals
- Efficiently satisfy allocations of a given size, but avoid fragmentation
Solution: allocate like-sized objects in blocks - Avoid locking in the common case
Solution: maintain per-CPU caches - Efficiently reclaim freeable memory
Solution: use bitmaps for metadata, run GC concurrently
Heap is divided into two levels of structure: Arenas and Spans
(The idea is same as slab allocator:
A global mheap struct keeps track of both of them:
type mheap struct { arenas [1 << 22]*heapArena // Covering map of arena frames free []mSpanList // Lists of fully free spans central []mcentral // Lists of in-use spans // plus much more }
Arenas are coarse sections of the available address space (64MB on 64-bit archs).
We allocate OS memory in units of arenas.
For each arena, there's metadata in a heapArena struct:
https://github.com/golang/go/blob/master/src/runtime/mheap.go#L201
type mheap struct { arenas [1 << 22]*heapArena // Covering map of arena frames // ... } type heapArena struct { // page-granularity map to spans spans [pagesPerArena]*mspan // pointer/scalar bitmap (2bits/word) bitmap [heapArenaBitmapBytes]byte }
Span:
There are ~70 different mspan size classes.
https://github.com/golang/go/blob/master/src/runtime/mheap.go#L308
type mspan struct { startAddr uintptr npages uintptr spanclass spanClass // allocated/free bitmap allocBits *gcBits // ... }
Each P has an mcache holding a span of each size class.
https://github.com/golang/go/blob/master/src/runtime/mcache.go#L19
Ideally, allocations can be satisfied directly out of the mcache (thus they're fast).
In Go, a P is a scheduling context.
( Refer to Golang scheduler note: http://vsdmars.blogspot.com/2018/06/golang-go-scheduler.html )
Generally there's one P per core (M, aka. thread) and
at most one goroutine running per P at a time(P has running queue):
To allocate, we find the first free object in our cached mspan, then return its address:
Let's say we need 96 bytes.
First we'd look in the mcache for the mspan with 96-byte objects.
After allocation would look like this:
This design means that "most" memory allocations are fast and require no locking.
There are 3 quick steps:
- Find a cached span with the right size ( mcache.mspan[sizeClass] )
- Find the next free object in the span
- If necessary, update the heap bitmap
Garbage collection
Go uses a tricolor concurrent mark-sweep garbage collector.
Reference:
Modern garbage collection A look at the Go GC strategy https://blog.plan99.net/modern-garbage-collection-911ef4f8bd8e
Tri-Color Incremental Updating GC https://stackoverflow.com/questions/2364274/tri-color-incremental-updating-gc-does-it-need-to-scan-each-stack-twice
Golang’s Real-time GC in Theory and Practice https://making.pusher.com/golangs-real-time-gc-in-theory-and-practice/
- Initially, all objects are marked as white. (3 colors exist: white, grey and black)
- GC starts by marking goroutine stacks and globals.
- When GC reaches an object, mark it grey.
- When an object's referents are all marked, GC mark the object with black.
- At the end, objects are either white or black.
- White objects can then be swept and freed.
However; GC can be run concurrently, thus:
type S struct { p *int } func f(s *S) *int { r := s.p s.p = nil // What if GC runs at this point? The returning pointer points to null. return r }
Rough solution explained:
Preventing this from occurring, compiler translates pointer writes into potential calls into the write barrier.
i.e Mark it has referent during a pointer assignment.
i.e When GC is marking, the write barrier is turned on.
quote:
Mark Setup - STW When a collection starts, the first activity that must be performed is turning on the Write Barrier.
The purpose of the Write Barrier is to allow the collector to maintain data integrity on the heap during a collection since both the collector and application goroutines will be running concurrently.
In order to turn the Write Barrier on, every application goroutine running must be stopped.
This activity is usually very quick, within 10 to 30 microseconds on average.
That is, as long as the application goroutines are behaving properly.
The purpose of the Write Barrier is to allow the collector to maintain data integrity on the heap during a collection since both the collector and application goroutines will be running concurrently.
In order to turn the Write Barrier on, every application goroutine running must be stopped.
This activity is usually very quick, within 10 to 30 microseconds on average.
That is, as long as the application goroutines are behaving properly.
Reference:
https://www.ardanlabs.com/blog/2018/12/garbage-collection-in-go-part1-semantics.html#:~:text=The%20purpose%20of%20the%20Write,goroutine%20running%20must%20be%20stopped.
That is to say:
Allocating a big buffer of scalars is much cheaper than allocating a big buffer of pointers, because you have to follow the pointers.
Use defer only if there's more then one return lambda being registered.
If single return condition, don't use defer due to defer is relatively _SLOW_.
Use DOD for designing types http://vsdmars.blogspot.com/search/label/design_dod
Use interface is always slow than pure type due to another layer of indirection.
Convert []byte to string type judiciously.
Use sync.Pool while necessary.
“We generally don’t want sync/atomic to be used at all...Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic operations...”
—Ian Lance Taylor
Why? This is why: http://vsdmars.blogspot.com/search/label/cpp_concurrent
The Go race detector doesn’t protect you from doing dumb stuff.
Unsafe is, in fact, unsafe.
Struct layout can make a big difference. (padding http://vsdmars.blogspot.com/2018/09/golangc-padding.html & ping-pong effect with mutex lock)
Go makes concurrency easy enough to be dangerous.
That is to say:
Allocating a big buffer of scalars is much cheaper than allocating a big buffer of pointers, because you have to follow the pointers.
Experiment with GC performance
1. Crude experimenting
- GOGC=off : Disables garbage collector
- GODEBUG=sbrk=1 : Replaces entire allocator with simple persistent allocator that gets big blocks from the OS and gives you successive slices of those as you allocate.
Problem with this approach:
Not useful for production code due to not reliable with large heap allocation.
2. Profiling
- Use pprof to check for runtime.mallocgc
- Use flamegraph viewer in the pprof web UI
- Or linux perf if pprof isn't enabled in the binary.
Problem with this approach:
- Program might not be CPU-bound
- Allocation might not be on critical path
- Time in background marking (gcBgMarkWorker) can mislead (time spent here doesn't necessarily mean you have net slowdown)
3. go tool trace
$ curl localhost:6060/debug/pprof/trace?seconds=5 > trace.out
$ go tool trace trace.out
Coding suggestion
- Limit pointers
Compiler will show which variable is on heap.
Software engineering 101, no matter what language you use, don't construct variables inside a loop.
If you really really need to, the type to be constructed should not have pointer type inside, which helps the GC to reasoning.
$ go build -gcflags="-m -m" - Allocate in batches (Hand make your own slab allocator)
- Try to recycle objects (e.g., sync.Pool) ( Refer to note for Concurrency in Go - Katherine Cox-Buday http://vsdmars.blogspot.com/2018/10/golangbook-concurrency-in-go-katherine.html )
Further reference:
Note for Tyler Treat's slides:
Scheduler is fairly grouping goroutines holding the same sync.Mutex instance on same CPU core avoiding ping-pong effect.Use defer only if there's more then one return lambda being registered.
If single return condition, don't use defer due to defer is relatively _SLOW_.
Use DOD for designing types http://vsdmars.blogspot.com/search/label/design_dod
Use interface is always slow than pure type due to another layer of indirection.
Convert []byte to string type judiciously.
Use sync.Pool while necessary.
“We generally don’t want sync/atomic to be used at all...Experience has shown us again and again that very very few people are capable of writing correct code that uses atomic operations...”
—Ian Lance Taylor
Why? This is why: http://vsdmars.blogspot.com/search/label/cpp_concurrent
The Go race detector doesn’t protect you from doing dumb stuff.
Unsafe is, in fact, unsafe.
Struct layout can make a big difference. (padding http://vsdmars.blogspot.com/2018/09/golangc-padding.html & ping-pong effect with mutex lock)
Go makes concurrency easy enough to be dangerous.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.