Jan 27, 2019

[split stack] reading notes and references

Reference:
gccgo split stack implementation
  1. The stack can start splitting at any point.
  2. The stack size is automatically recorded at program startup,
    and each thread startup.
  3. The gold linker detects calls from split-stack code to non-split-stack
    code, and rewrites the function header to force a large stack segment to be allocated.
    i.e.
    When not using the gold linker, calls from split-stack code to non-split-stack code will just have whatever is left of the current stack segment, which may not be large enough.
    (look up to "Backward compatibility" section)


In the complex GCC ecosystem the linker is separate from the compiler.
GCC can't assume that gold is available at all.
When building gccgo, configure using
--with-ld=/path/to/gold

The -fuse-ld=gold option is newer than gccgo.
Ian supposes it would be nice if:
* the GCC configure process checks whether -fuse-ld=gold works; if so:
  * -fuse-ld=gold is passed to the libgo configure/build
  * -fuse-ld=gold is used by default by the gccgo driver program



Reference:
Split Stacks in GCC




Obvious benefits

  • The memory usage of a typical multi-threaded program can decrease significantly, as each thread does not require a worst-case stack size.
  • It becomes possible to run millions of threads
    (either full NPTL threads or co-routines) in a 32-bit address space.




Basic explained

Stack will have a guaranteed zone which is always available.
Reference:
[LWN] Preventing stack guard-page hopping


The size of the guard area will be target specific.
It will include enough stack space to actually allocate more stack space.
Each function will have to verify that it has enough space in the current stack to execute.

The basic verification will be a comparison between the stack pointer and the current bottom of the stack plus the guaranteed zone size.
This will have to be the first operation in the function, and will also be target specific.

It must be fast, as it will be executed by each called function.

Two cases to consider.
  1. For functions which require a stack frame less than the size of the guaranteed guard area, we can do a simple comparison between the stack pointer and the stack limit.
  2. For functions which require a larger stack frame, we must do a comparison including the size of the stack frame.




Design options

  1. Reserve a register to hold the bottom of the stack plus the guaranteed size. This will have to be a callee-saved register.
  2. Use a TLS(Thread Local Storage) variable. In the general case, in a shared library, this will require calling the __tls_get_addr function.
    Reference:
    How fast is thread local variable access on Linux
    (GOLD elf linker)
    http://gittup.org/cgi-bin/man/man2html?gold+1 

    That means that that function will have to work without requiring any additional stack space.
    This is infeasible unless the whole system is compiled with split stacks.
    It would require dlopen's LD_BIND_NOW to be set, so that the __tls_get_addr function is resolved at program startup time.
    Even that is probably insufficient unless we can ensure that the space for the (TLS) variable is fully allocated.
    In general Ian doesn't think they can ensure this, because dlopen can cause a thread to require more space for TLS variables, and that space will be allocated on the first call to __tls_get_addr.
    Reference:
    http://man7.org/linux/man-pages/man8/ld.so.8.html
    LD_BIND_NOW (since glibc 2.1.1)
    If set to a nonempty string, causes the dynamic linker to
    resolve all symbols at program startup instead of deferring
    function call resolution to the point when they are first
    referenced.  This is useful when using a debugger.
  3. Have the stack always end at a N-bit boundary.
    E.g., if we always allocate stack segments as a multiple of 4K,
    then align each one so that the stack always ends at a 12-bit boundary.
    Then the amount of space remaining on the stack is SP & 0xfff.
  4. Introduce a new function call which handles the comparison of the stack pointer and the stack expansion.
  5. Reuse the stack protector support field.
    When using glibc each thread descriptor has a field used by the stack protector.
    Of course it is then not possible to use split stacks in conjunction with stack protector.
  6. At least on x86, arrange to allocate a new field in the TCB(thread control block) header accessible via %fs or %gs.
    This is probably the best solution, and it is the one implemented for i386 and x86_64.

Reference:
TCB Thread Control Block in linux kernel:
https://en.wikipedia.org/wiki/Thread_control_block



Expanding the stack

  • Expanding the stack requires allocating additional memory.
  • This additional memory will have to be allocated using only the stack space slot.
  • All of the functions used to allocate additional stack space must be compiled to not use a split stack.
  • A new function attribute, no_split_stack will be introduced to mean that the stack should not be split.
  • It would also work to ensure that the stack is large enough that they do not need to split the stack during the allocation call.
  • After expanding the stack, the function will copy any stack based parameters from the old stack to the new stack.
  • Fortunately, all C++ objects which require a copy or move constructor are implicitly passed by reference,so copying the parameters on the stack is OK.
  • For varargs functions, this is impossible in general, so we will compile varargs functions differently:
    they will use an argument pointer which is not necessarily based on the frame pointer.
    For functions which return objects on the stack, the objects will be returned on the old stack. (RVO)
    This should normally happen automatically, as the initial hidden parameter will naturally point to the old stack.
  • When expanding the stack, the return address of the function will be managed to point to a function which will release the allocated stack block and reset the stack pointer to the caller.
    Reference:
    http://vsdmars.blogspot.com/2017/11/assembly-note.html
  • The address of the old stack block, and the old stack pointer, will have been saved somewhere in the new stack block.




Backward compatibility

We want to be able to use split stack programs on systems with pre-built libraries compiled without split stacks.
This means that we need to ensure that there is sufficient stack space before calling any such function.

Each object file compiled in split stack mode will be annotated to indicate that the functions use split stacks.

This should probably be annotated with a note but there is no general support for creating arbitrary notes in GNU as.

Therefore, each object file compiled in split stack mode will have an empty section with a special name: .note.GNU-split-stack

If an object file compiled in split stack mode includes some functions with the no_split_stack attribute, then the object file will also have a .note.GNU-no-split-stack section.

This will tell the linker that some functions may not have the expected split stack prologue.

When the linker links an executable or shared library, it will look for calls from split-stack code to non-split-stack code.

This will include calls to non-split-stack shared libraries
(thus, a program linked against a split-stack shared library may fail if at runtime the dynamic linker finds a non-split-stack shared library;
it might be desirable to use a new segment type to detect this situation).

For calls from split-stack code to non-split-stack code, the linker will change the initial instructions in the split-stack (caller) function.
This means that the linker will have to have special knowledge of the instructions that the compiler emits.
The effect of the changes will be to increase the required frame-size by a number large enough to reasonably work for a non-split-stack.
This will be a target dependent number; the default will be something like 64K.
Note that this large stack will be released when the split-stack function returns.
Note that I'm disregarding the case of split-stack code in a shared library calling non-split-stack code in the main executable; that seems like an unlikely problem.


Function pointers are a tricky case.
In general we don't know whether a function pointer points to split-stack code.
Therefore, all calls through a function pointer will be modified to call (or jump to) a special function __fnptr_morestack.
This will use a target specific function calling sequence, and will be implemented as though it were itself a function call instruction.
That is, all the parameters will be set up, and then the code will jump to __fnptr_morestack.
The __fnptr_morestack function takes two parameters: the function pointer to call, and the number of bytes of arguments pushed on the stack.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.