Showing posts with label kernel_cgroups. Show all posts
Showing posts with label kernel_cgroups. Show all posts

Dec 24, 2018

[linux][namespace] wrap-up

Linux Namespace is relatively new idea in the linux space which is the
fundamental of containers as well as Kubernetes.

Those design/api were not mentioned in the TLPI book, which becomes more essential nowadays due to the rise of distributed computing(sever-less, lambda, whatever fancy words you name it~)

This page is used as index page for further linux namespace ideas/design/programming.
Currently my coding language are C++(modern)/Golang/Python.


Traditional process resource limit:

http://man7.org/linux/man-pages/man1/prlimit.1.html
$ prlimit --nofile=256 --nproc=512 --locks=32 /bin/bash


ps shows PPID/SID:
$ ps -efj


Linux namespace directories for debugging:
/proc/*/ns/*
/proc/*/task/*/ns/*
/proc/self/ns  # caller's namespace information
/proc/sys/kernel/ns_last_pid # the last PID that was allocated in this PID namespace.
/run/netns/netns-name  # created network namespace
/run/netns/default  # default network namespace
/sys/fs/cgroup # cgroup information

Namespace is differentiated by ID(integer)


List all namespace ID with in all processes:
$ readlink /proc/*/task/*/ns/* | sort -u


List all namespace under root PID 1 namespace:
This can be used to find the default linux namespaces
$ readlink /proc/1/task/*/ns/* | sort -u


Use bind mount to persist a linux namespace:
Reference:
https://unix.stackexchange.com/a/198591

$ mount --bind /proc/pid/ns/type /anywhere/you/want
Thus later on you could use nsenter(1)/unshare(1)(2)/setns(2) to
enter that namespace.


Linux PID namespace has some extended behaviors which should be noticed:
  • A process's namespace is settled when it's created. Period.
    It CANNOT be changed even with 'setns'.
    'setns' will only associated the child created by the caller PID with the new
    namespace but not the caller itself.
    (Since Linux 4.12, that new PID namespace is shown via the
    /proc/[pid]/ns/pid_for_children file.)
    Once the caller PID calls 'setns', all it's children will be put into the new PID namespace.
    The children's call to 'getppid(2)' will return 0 since they
    CANNOT observe the PID outside it's own PID namespace.
    Beware, processes may not enter any ancestor namespaces (parent, grandparent, etc.).
    Changing PID namespaces is a one-way operation.
    Use ioctl_ns to get the parent namespace information.
    code: https://github.com/verbalsaintmars/ns_show

    That is to say,
    PID namespace parent/child namespace relationship honors the design of
    2 layer relationship in Session/Process Group(Job), Process Group/ProcessParent Process/Child Process.
  • Ancestor namespace PIDs can send kill signals to other PID namespace's PID 1 which honors the 'kill' system call privilege checks, plus, the other PID namepace's PID 1 has the corresponding signal handlers installed.
  • Starting with Linux 3.4, the reboot(2) system call causes a signal to be sent to the namespace "init" process.
  • If the "init" process of a PID namespace terminates, the kernel
    terminates all of the processes in the namespace via a SIGKILL signal.
  • If the 'init' process, which usually is PID 1, terminates, and later on there's new PID want's to join this PID linux namespace which has the 'init' process terminated, the new PID called by fork will error out with errorno: ENOMEM, which is: 'fork cannot allocate memory'
    (the ENOMEM comes from the 'PIDNS_HASH_ADDING' has been unset once PID 1 dies which calls disable_pid_allocation() and if a new PID intends to be created by calling alloc_pid(), ENOMEM is set.)
  • Thus, 'unshare' with or without -f behaves as:
    -f (use fork):
    --fork will thus telling 'unshare' to fork the 'cmd' into the new namespace as the first existing process. (i.e PID 1)

    without -f (use exec):
    the 'cmd' is not running in the new pid namespace but it's fork process is.
    Reference:
    https://stackoverflow.com/a/45973522 https://unix.stackexchange.com/a/393279
  • PID namespaces can be nested, except for the 'default' PID namespace.
    Since Linux 3.7, the kernel limits the maximum nesting depth for PID namespaces to 32 (Nesting PID namespaces).
    A process can see (e.g., send signals with kill(2), set nice values with setpriority(2), etc.) only processes contained in its own PID namespace and in descendants of that namespace.
  • A call to getpid(2) always returns the PID associated with the
    namespace in which the process was created.
  • In current versions of Linux,
    CLONE_NEWPID can't be combined with CLONE_THREAD.
    Threads are required to be in the same PID namespace such that the threads in a process can send signals to each other.
    Similarly, it must be possible to see all of the threads of a
    processes in the proc(5) filesystem.
  • A /proc filesystem shows (in the /proc/[pid] directories) only processes visible in the PID namespace of the process that performed the mount, even if the /proc filesystem is viewed from processes in other namespaces.
    That's the reason 'unshare' provides '--mount-proc' argument, which mounts /proc with in the new created PID namespace with new Mount namespace.
  • When a process ID is passed over a UNIX domain socket to a process in a different PID namespace, it is translated into the corresponding PID value in the receiving process's PID namespace.


'unshare' with or without -f behaves explained:


The error is caused by the PID 1 process exits in the new namespace.

After bash start to run, bash will fork several new sub-processes to do somethings.
If you run unshare without -f, bash will have the same pid as the current "unshare" process.
The current "unshare" process call the unshare systemcall, create a new pid namespace, but the current "unshare" process is not in the new pid namespace.
It is the desired behavior of linux kernel: process A creates a new namespace, the process A itself won't be put into the new namespace, only the sub-processes of process A will be put into the new namespace. So when you run:
$ unshare -p /bin/bash

The unshare process will exec /bin/bash, and /bin/bash forks several sub-processes, the first sub-process of bash will become PID 1 of the new namespace, and the subprocess will exit after it completes its job.
So the PID 1 of the new namespace exits.

The PID 1 process has a special function:
It should become all the orphan processes' parent process.
If PID 1 process in the root namespace exits, kernel will panic.
If PID 1 process in a sub namespace exits, linux kernel will call the disable_pid_allocation function, which will clean the PIDNS_HASH_ADDING flag in that namespace.
When linux kernel create a new process, kernel will call alloc_pid function to allocate a PID in a namespace, and if the PIDNS_HASH_ADDING flag is not set, alloc_pid function will return a -ENOMEM error. That's why you got the "Cannot allocate memory" error.

You can resolve this issue by use the '-f' option:
$ unshare -fp /bin/bash

If you run unshare with '-f' option, unshare will fork a new process after it create the new pid namespace. And run /bin/bash in the new process. The new process will be the pid 1 of the new pid namespace.
Then bash will also fork several sub-processes to do some jobs.
As bash itself is the pid 1 of the new pid namespace, its sub-processes can exit without any problem.


Reference:
Namespaces in operation(lwn.net): https://lwn.net/Articles/531114/#series_index
Resource management: Linux kernel Namespaces and cgroups: http://www.haifux.org/lectures/299/netLec7.pdf
Control groups series by Neil Brown https://lwn.net/Articles/604609/

Oct 11, 2016

[systemd] design notes

Reference:
http://www.0pointer.de/blog/projects/systemd.html

For a fast and efficient boot-up two things are crucial

  • To start less.
  • And to start more in parallel.


Parallelizing Socket Services

  • Wouldn't it be great if we could get rid of the synchronization and serialization cost?
  • What are we waiting?
    • they wait until the socket the other daemon offers its services on is ready for connections.
  • Solution
    • create the listening sockets before we actually start the daemon.
    • and then just pass the socket during exec() to it. That way, we can create all sockets for
    • all daemons in one step in the init system, and then in a second step run all daemons at once.
    • If a service needs another, and it is not fully started up, that's completely OK:
    • what will happen is that the connection is queued in the providing service and the client will potentially block on that single request.
    • But only that one client will block and only on that one request.
    • Basically, the kernel socket buffers help us to maximize parallelization, and the ordering and synchronization is done by the kernel, without any further management from userspace


AF_UNIX
  • A good init system should start only what is needed, and that on-demand.
  • Either lazily or parallelized and in advance.
  • However it should not start more than necessary, particularly not everything installed that could use that service.


  • How can we keep track of processes, so that they cannot escape the babysitter, and that we can control them as one unit even if they fork a gazillion times?
  • Solution:
    • Control Groups (aka "cgroups").

 

System should have

1.
service: these are the most obvious kind of unit: daemons that can be started, stopped, restarted, reloaded.

2.
socket: this unit encapsulates a socket in the file-system or on the Internet.