Aug 17, 2021

[monitoring] Book: Monitoring Distributed Systems

Book:
Monitoring Distributed Systems

Monitoring

Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.
  • White-box monitoring (k8s readiness probe)
    Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.
  • Black-box monitoring (k8s liveness probe)
    Testing externally visible behavior as a user would see it.
  • Dashboard
    An application (usually web-based) that provides a summary view of a service’s core metrics. A dashboard may have filters, selectors, and so on, but is prebuilt to expose the metrics most important to its users. The dashboard might also display team information such as ticket queue length, a list of high-priority bugs, the current on-call engineer for a given area of responsi‐ bility, or recent pushes.
  • Alert
    A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager. Respectively, these alerts are classified as tickets, email alerts, and pages.
  • Root cause
    A defect in a software or human system that, if repaired, instills confidence that this event won’t happen again in the same way. A given incident might have multiple root causes: for example, perhaps it was caused by a combination of insufficient process automation, software that crashed on bogus input, and insuffi‐ cient testing of the script used to generate the configuration. Each of these factors might stand alone as a root cause, and each should be repaired.
  • Node (or machine)
    Used interchangeably to indicate a single instance of a running kernel in either a physical server, virtual machine, or container. There might be multiple services worth monitoring on a single machine. The services may either be:
    • Related to each other: for example, a caching server and a web server
    • Unrelated services sharing hardware: for example, a code repository and a master for a configuration system like Puppet or Chef
  • Push
    Any change to a service’s running software or its configuration.


Why Monitor?

  • Analyzing long-term trends
  • Comparing over time or experiment groups
  • Alerting
  • Building dashboards (“The Four Golden Signals”: latency, traffic, errors, and saturation.)
  • Conducting ad hoc retrospective analysis (i.e., debugging)

Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break. When the system isn’t able to automatically fix itself, we want a human to investigate the alert, determine if there’s a real problem at hand, mitigate the problem, and determine the root cause of the problem.

In general, Google has trended toward simpler and faster monitor‐ ing systems, with better tools for post hoc analysis. We avoid “magic” systems that try to learn thresholds or automatically detect causality.

To keep noise low and signal high, the elements of your monitoring system that direct to a pager need to be very simple and robust.

Rules that generate alerts for humans should be simple to understand and represent a clear failure.

Your monitoring system should address two questions: 
  • what’s broken,
  • and why?
“What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.

In Google’s experience, basic collection and aggregation of metrics, paired with alerting and dashboards, has worked well as a relatively standalone system.

When creating rules for monitoring and alerting, asking the following questions can help you avoid false positives and pager burnout:

  • Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible?
  • Will I ever be able to ignore this alert, knowing it’s benign? When and why will I be able to ignore this alert, and how can I avoid this scenario?
  • Does this alert definitely indicate that users are being negatively affected? Are there detectable cases in which users aren’t being negatively impacted, such as drained traffic or test deployments, that should be filtered out?
  • Can I take action in response to this alert? Is that action urgent, or could it wait until morning? Could the action be safely automated? Will that action be a long-term fix, or just a short-term workaround?
  • Are other people getting paged for this issue, therefore rendering at least one of the pages unnecessary?

These questions reflect a fundamental philosophy on pages and pagers:

  • Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
  • Every page should be actionable.
  • Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
  • Pages should be about a novel problem or an event that hasn’t been seen before.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.