Ataraxia through Epoché: [linux][namespace] wrap-up

Linux Namespace is relatively new idea in the linux space which is the
fundamental of containers as well as Kubernetes.

Those design/api were not mentioned in the TLPI book, which becomes more essential nowadays due to the rise of distributed computing(sever-less, lambda, whatever fancy words you name it~)

This page is used as index page for further linux namespace ideas/design/programming.
Currently my coding language are C++(modern)/Golang/Python.

Traditional process resource limit:
http://man7.org/linux/man-pages/man1/prlimit.1.html
$ prlimit --nofile=256 --nproc=512 --locks=32 /bin/bash

ps shows PPID/SID:
$ ps -efj

Linux namespace directories for debugging:
/proc/*/ns/*
/proc/*/task/*/ns/*
/proc/self/ns # caller's namespace information
/proc/sys/kernel/ns_last_pid # the last PID that was allocated in this PID namespace.
/run/netns/netns-name # created network namespace
/run/netns/default # default network namespace
/sys/fs/cgroup # cgroup information

Namespace is differentiated by ID(integer)

List all namespace ID with in all processes:
$ readlink /proc/*/task/*/ns/* | sort -u

List all namespace under root PID 1 namespace:
This can be used to find the default linux namespaces
$ readlink /proc/1/task/*/ns/* | sort -u

Use bind mount to persist a linux namespace:
Reference:
https://unix.stackexchange.com/a/198591

$ mount --bind /proc/pid/ns/type /anywhere/you/want
Thus later on you could use nsenter(1)/unshare(1)(2)/setns(2) to
enter that namespace.

Linux PID namespace has some extended behaviors which should be noticed:

A process's namespace is settled when it's created. Period.
It CANNOT be changed even with 'setns'.
'setns' will only associated the child created by the caller PID with the new
namespace but not the caller itself.
(Since Linux 4.12, that new PID namespace is shown via the
/proc/[pid]/ns/pid_for_children file.)
Once the caller PID calls 'setns', all it's children will be put into the new PID namespace.
The children's call to 'getppid(2)' will return 0 since they
CANNOT observe the PID outside it's own PID namespace.
Beware, processes may not enter any ancestor namespaces (parent, grandparent, etc.).
Changing PID namespaces is a one-way operation.
Use ioctl_ns to get the parent namespace information.
code: https://github.com/verbalsaintmars/ns_show

That is to say,
PID namespace parent/child namespace relationship honors the design of
2 layer relationship in Session/Process Group(Job), Process Group/Process, Parent Process/Child Process.
Ancestor namespace PIDs can send kill signals to other PID namespace's PID 1 which honors the 'kill' system call privilege checks, plus, the other PID namepace's PID 1 has the corresponding signal handlers installed.
Starting with Linux 3.4, the reboot(2) system call causes a signal to be sent to the namespace "init" process.
If the "init" process of a PID namespace terminates, the kernel
terminates all of the processes in the namespace via a SIGKILL signal.
If the 'init' process, which usually is PID 1, terminates, and later on there's new PID want's to join this PID linux namespace which has the 'init' process terminated, the new PID called by fork will error out with errorno: ENOMEM, which is: 'fork cannot allocate memory'
(the ENOMEM comes from the 'PIDNS_HASH_ADDING' has been unset once PID 1 dies which calls disable_pid_allocation() and if a new PID intends to be created by calling alloc_pid(), ENOMEM is set.)
Thus, 'unshare' with or without -f behaves as:
-f (use fork):
--fork will thus telling 'unshare' to fork the 'cmd' into the new namespace as the first existing process. (i.e PID 1)

without -f (use exec):
the 'cmd' is not running in the new pid namespace but it's fork process is.
Reference:
https://stackoverflow.com/a/45973522 https://unix.stackexchange.com/a/393279
PID namespaces can be nested, except for the 'default' PID namespace.
Since Linux 3.7, the kernel limits the maximum nesting depth for PID namespaces to 32 (Nesting PID namespaces).
A process can see (e.g., send signals with kill(2), set nice values with setpriority(2), etc.) only processes contained in its own PID namespace and in descendants of that namespace.
A call to getpid(2) always returns the PID associated with the
namespace in which the process was created.
In current versions of Linux,
CLONE_NEWPID can't be combined with CLONE_THREAD.
Threads are required to be in the same PID namespace such that the threads in a process can send signals to each other.
Similarly, it must be possible to see all of the threads of a
processes in the proc(5) filesystem.
A /proc filesystem shows (in the /proc/[pid] directories) only processes visible in the PID namespace of the process that performed the mount, even if the /proc filesystem is viewed from processes in other namespaces.
That's the reason 'unshare' provides '--mount-proc' argument, which mounts /proc with in the new created PID namespace with new Mount namespace.
When a process ID is passed over a UNIX domain socket to a process in a different PID namespace, it is translated into the corresponding PID value in the receiving process's PID namespace.

'unshare' with or without -f behaves explained:

The error is caused by the PID 1 process exits in the new namespace.

After bash start to run, bash will fork several new sub-processes to do somethings.
If you run unshare without -f, bash will have the same pid as the current "unshare" process.
The current "unshare" process call the unshare systemcall, create a new pid namespace, but the current "unshare" process is not in the new pid namespace.
It is the desired behavior of linux kernel: process A creates a new namespace, the process A itself won't be put into the new namespace, only the sub-processes of process A will be put into the new namespace. So when you run:
$ unshare -p /bin/bash

The unshare process will exec /bin/bash, and /bin/bash forks several sub-processes, the first sub-process of bash will become PID 1 of the new namespace, and the subprocess will exit after it completes its job.
So the PID 1 of the new namespace exits.

The PID 1 process has a special function:
It should become all the orphan processes' parent process.
If PID 1 process in the root namespace exits, kernel will panic.
If PID 1 process in a sub namespace exits, linux kernel will call the disable_pid_allocation function, which will clean the PIDNS_HASH_ADDING flag in that namespace.
When linux kernel create a new process, kernel will call alloc_pid function to allocate a PID in a namespace, and if the PIDNS_HASH_ADDING flag is not set, alloc_pid function will return a -ENOMEM error. That's why you got the "Cannot allocate memory" error.

You can resolve this issue by use the '-f' option:
$ unshare -fp /bin/bash

If you run unshare with '-f' option, unshare will fork a new process after it create the new pid namespace. And run /bin/bash in the new process. The new process will be the pid 1 of the new pid namespace.
Then bash will also fork several sub-processes to do some jobs.
As bash itself is the pid 1 of the new pid namespace, its sub-processes can exit without any problem.

Reference:
Namespaces in operation(lwn.net): https://lwn.net/Articles/531114/#series_index

Resource management: Linux kernel Namespaces and cgroups: http://www.haifux.org/lectures/299/netLec7.pdf
Control groups series by Neil Brown https://lwn.net/Articles/604609/

Ataraxia through Epoché

Dec 24, 2018

[linux][namespace] wrap-up

'unshare' with or without -f behaves explained:

No comments:

Post a Comment