Showing posts with label design_graceful_shutdown. Show all posts
Showing posts with label design_graceful_shutdown. Show all posts

Jul 9, 2020

[unix][programming] EINTR and What It Is Good For (http://250bpm.com/blog:12)

Reference:
EINTR and What It Is Good For Martin Sústrik, zeromq


Before we dive, this concept is well mentioned in Richard Stevens's UNIX Network Programming - Ch.20.5, thus Martin Sústrik's blog post can be considered as a recap of EINTR error.

Rule of thumb: 

When handling EINTR error, check any conditions that may have been altered by signal handlers.
Then restart the blocking function.

Additionally, If you are implementing a blocking function yourself, take care to return EINTR when you encounter a signal.

Beware those 2 POSIX functions which don't honor EINTR


Consider this code:
volatile int stop = 0;

void handler (int)
{
    stop = 1;
}

void event_loop (int sock)
{
    signal (SIGINT, handler);

    while (1) {
        if (stop) {  // never hit if recv is blocked
            printf ("do cleanup\n");
            return;
        }
        char buf [1];
        recv (sock, buf, 1, 0);  // block call
        printf ("perform an action\n");
    }
}

Above is the reason POSIX has EINTR error.

Modify code to this:
noted that to make blocking functions like recv return EINTR you may have to use sigaction() with SA_RESTART set to zero instead of signal() on some operating systems.
volatile int stop = 0;

void handler (int)
{
    stop = 1;
}

void event_loop (int sock)
{
    signal (SIGINT, handler);

    while (1) {
        if (stop) {
            printf ("do cleanup\n");
            return;
        }
        char buf [1];
        int rc = recv (sock, buf, 1, 0);
        if (rc == -1 && errno == EINTR)  // if interrupted by signal, continue while loop
            continue;
        printf ("perform an action\n");
    }
}


But, this isn't a graceful shutdown.
We have to exhaust the incoming message before exit.
When you press Ctrl+C, program exits performing the clean-up beforehand.

The morale of this story is that common advice to just restart the blocking function when EINTR is returned doesn't quite work:
volatile int stop = 0;

void handler (int)
{
    stop = 1;
}

void event_loop (int sock)
{
    signal (SIGINT, handler);

    while (1) {
        if (stop) {
            printf ("do cleanup\n");
            return;
        }
        char buf [1];
        while (1) {
            // even signaled with stop == 1, and no more incoming data, we are stucked here..
            int rc = recv (sock, buf, 1, 0); 
            // if signaled, continue to recv, otherwise, message consumed, break inner loop
            if (rc == -1 && errno == EINTR) 
                continue;
            break;
        }
        printf ("perform an action\n");
    }
}


Even EINTR is not completely water-proof, check this code:
volatile int stop = 0;

void handler (int)
{
    stop = 1;
}

void event_loop (int sock)
{
    signal (SIGINT, handler);

    while (1) {
        if (stop) {
            printf ("do cleanup\n");
            return;
        }

        /*  What if signal handler is executed at this point? */
        /* pressing Ctrl+C for the second time sorts the problem out */

        char buf [1];
        // even stop == 1, and no more data coming, we are stucked here...
        int rc = recv (sock, buf, 1, 0);
        if (rc == -1 && errno == EINTR)
            continue;
        printf ("perform an action\n");
    }
}



Ultimate solution

use pselect, which mask the signals before calling pselect, and allow signal to pass during the pselect(which if signal occurs, pselect returns).

select
int select(int nfds,
                    fd_set *readfds,
                    fd_set *writefds,
                    fd_set *exceptfds, 
                    struct timeval *timeout);
                    

nfds should be n + 1 (exclusive bound), this optimizing the linear check of fds.

Be sure to check the definition under what conditions is a Descriptor ready for network FDs.

Notice that when an error occurs on a socket, both readable and writable is marked by select.

Although the timeval structure lets us specify a resolution in microseconds, the actual resolution supported by the kernel is often more coarse.
Many Unix kernels round the timeout value up to a multiple of 10ms. There is also a scheduling latency involved, meaning it takes some time after the timer expires before the kernel schedules this process to run.


void FD_ZERO(fd_set *fdset);
void FD_SET(int fd, fd_set *fdset);
void FD_CLR(int fd, fd_set *fdset);
int FD_ZERO(int fd, fd_set *fdset);




pselect
 int pselect(
            int nfds,
            fd_set *restrict readfds,
            fd_set *restrict writefds,
            fd_set *restrict errorfds,
            const struct timespec *restrict timeout,
            const sigset_t *restrict sigmask);

example:
// https://github.com/k84d/unpv13e/blob/master/bcast/dgclibcast4.c
#include "unp.h"

static void recvfrom_alarm(int);

void
dg_cli(FILE *fp, int sockfd, const SA *pservaddr, socklen_t servlen)
{
 int    n;
 const int  on = 1;
 char   sendline[MAXLINE], recvline[MAXLINE + 1];
 fd_set   rset;
 sigset_t  sigset_alrm, sigset_empty;
 socklen_t  len;
 struct sockaddr *preply_addr;
 
 preply_addr = Malloc(servlen);

 Setsockopt(sockfd, SOL_SOCKET, SO_BROADCAST, &on, sizeof(on));

 FD_ZERO(&rset);

 Sigemptyset(&sigset_empty);
 Sigemptyset(&sigset_alrm);
 Sigaddset(&sigset_alrm, SIGALRM);

 Signal(SIGALRM, recvfrom_alarm);

 while (Fgets(sendline, MAXLINE, fp) != NULL) {
  Sendto(sockfd, sendline, strlen(sendline), 0, pservaddr, servlen);

  Sigprocmask(SIG_BLOCK, &sigset_alrm, NULL);
  alarm(5);
  for ( ; ; ) {
   FD_SET(sockfd, &rset);
   n = pselect(sockfd+1, &rset, NULL, NULL, NULL, &sigset_empty);
   if (n < 0) {
    if (errno == EINTR)
     break;
    else
     err_sys("pselect error");
   } else if (n != 1)
    err_sys("pselect error: returned %d", n);

   len = servlen;
   n = Recvfrom(sockfd, recvline, MAXLINE, 0, preply_addr, &len);
   recvline[n] = 0; /* null terminate */
   printf("from %s: %s",
     Sock_ntop_host(preply_addr, len), recvline);
  }
 }
 free(preply_addr);
}

static void
recvfrom_alarm(int signo)
{
 return;  /* just interrupt the recvfrom() */
}


poll
int poll(
        struct pollfd *fds,
        nfds_t nfds,
        const struct timespec *tmo_p,
        const sigset_t *sigmask);
        

Apr 19, 2020

[Go] Best practice trivial update


  • Check defer call error.
    e.g
    https://github.com/thanos-io/thanos/blob/master/pkg/runutil/runutil.go#L138
  • For graceful shutdown, exhaust reader before exit.
    e.g
    https://github.com/thanos-io/thanos/blob/master/pkg/runutil/runutil.go#L148
  • no Init(), no extern globals (only const allowed)
  • Don't use panic (I do use panic for writing to closed channel for graceful shutdown)
  • Always measure the results.
    https://vsdmars.blogspot.com/2016/01/c.html
  • dereference pointer and write to it is slow (this is basic if coming from C++/C) due to it's really an read and write (not single write)
  • Pre-allocating Slices and Maps (same if coming from C++: STL, std::vector is really a new allocate and copy)
  • Reuse underline allocated memory from slice:
    slice = slice[:0]
    Slice data structure same as SOO(small object optimization) in std::string,
    pointer to slice, length of current slice, slice mallocaed size / data type (i.e capacity).
  • blank _ tips: use for assuring a type implements an interface:
     var _ InterfaceA = TypeA

Nov 20, 2019

[Go] High performancei, design lesson from FastHttp opensource project

Working on my stealth mode library project - vact,
there are several design calls can follow through the fasthttp project.

Jot it down here.

Server:
https://github.com/valyala/fasthttp

Client:
https://godoc.org/github.com/valyala/fasthttp#Client
HostClient
https://godoc.org/github.com/valyala/fasthttp#HostClient
PipelineClient
https://godoc.org/github.com/valyala/fasthttp#PipelineClient
(Head Of Line blocking issue due to it's using pipelined requests)

Reference:
http/1.0 pipelining
https://en.wikipedia.org/wiki/HTTP_pipelining
http/2.0
https://en.wikipedia.org/wiki/HTTP/2
QUIC
https://en.wikipedia.org/wiki/QUIC


Tips:
  1. Re-use object
    func fetchURLS(ch <-chan string) {
        var r Response
        for url := range ch {
            Get(url, &r)
            processResponse(&r)
            // reset response object, so it may be re-used
            r.Reset()
        }
    }
    
  2. Do not put object into pool if some header/tag is present.
    Close/re-cycle such connection/object instead.
  3. Server may stop responding, so goroutines calling Get() may
    pile up indefinitely.
    One way to detect it:
    (Check graceful shutdown: http://vsdmars.blogspot.com/2019/02/golangdesign-graceful-shutdown.html)
  4. func GetTimeout(url string, timeout time.Duration, resp *Response) {
        select {
            case concurrencyCh <- struct{}{}
            default:
            // too many goroutines are blocked in Get(), return back to caller
        }
        ch := make(chan struct{})
    
        go func() {
            Get(url, resp)
            <-concurrencyCh
            close(ch)
        }()
    
        select {
            case <-ch:
            case <-time.After(timeout):
        }
    }
  5. For every kind of network connection, setup a timeout!
    It's been used in TCP as well.
  6. conn := dialHost(url)
    conn.SetReadDeadline(time.Now().Add(readTimeout))
    conn.SetWriteDeadline(time.Now().Add(writeTimeout))
    
  7. Should the connection be closed on timed out request?
    Only if you want to DoS the remote server, otherwise, let it timeout.
  8. Pool size should ALWAYS be limited.
  9. After connection spike, there could be connection left overs.
    Limit the connection life time.
  10. Each tcp dial will trigger a DNS request.
    We could cache host -> ip mapping for a period of time, or
    rely on host's dns cache
  11. http client can be smart, dial with round-robin to each ip behind single
    domain name.
  12. Each dial should have timeout.

Feb 17, 2019

[structure concurrency][Go][design] Graceful Shutdown

Recently I'm working on Actor design utilizing with Golang's channel.

Link: https://github.com/vsdmars/actor

While bumped into Martin Sústrik's blog post:
Graceful Shutdown: http://250bpm.com/blog:146
Along with the discussion: https://trio.discourse.group/t/graceful-shutdown/93
Structured concurrency resources: https://trio.discourse.group/t/structured-concurrency-resources/21

It's a nice summary, touched the well known concept of pthread cancellation point as well as Golang's context.Done()/Cancel() pattern.

Notes:
In-band cancel:
through application level channel

out-of-band cancel:
through language run-time

Sending a graceful shutdown request cannot possibly be a feature of the language. It must be done manually by the application.


POSIX C thread:
Reference:
https://stackoverflow.com/a/27374983
http://man7.org/linux/man-pages/man7/pthreads.7.html (Cancellation points)

The rules of hard cancellation are strict:
Once a coroutine has been hard-canceled the very next blocking call will immediately return ECANCELED.
In other words, every single blocking call is a cancellation point.

As an additional note, if looking at the POSIX requirement for cancel points, virtually all blocking interfaces are required to be cancel points.
Otherwise, on any completely blocked thread (in such call), there would be no safe way to terminate that thread.

Graceful shutdown can terminate the coroutine only when it is idling and waiting for a new request.

Reasoning:
If we allowed graceful shutdown to terminate the coroutine while it is trying to send the reply, it would mean that a request could go unanswered.
And that doesn't deserve to be called "graceful shutdown".
(e.g. Golang, context.Done() pattern)


Definition summarize:
Hard cancellation:
- Is triggered via an invisible communication channel created by the language runtime.
- It manifests itself inside the target coroutine as an error code (ECANCELED in libdill) or an exception (Cancelled in Trio).
- The error (or the exception) can be returned from any blocking call.
- In response to it, the coroutine is not expected to do any application-specific work. It should just exit.

Graceful shutdown:
- Is triggered via an application-specific channel.
- Manifests itself inside the target coroutine as a plain old message.
- The message may only be received at specific, application-defined points in the coroutine.
- In response to it, the coroutine can do arbitrary amount of application-specific work.

Hard cancellation is fully managed by the language.
Graceful shutdown is fully managed by the application.

Golang library reference:
go-resiliency/deadline: https://github.com/eapache/go-resiliency/tree/master/deadline