Ataraxia through Epoché: [paper][distributed computing] Large-scale cluster management at Google with Borg

Large-scale cluster management at Google with Borg - paper

Borg:

To achieve:
Admission control
Efficient task-packing
Over-commitment
Machine sharing with process-level performance isolation
Supports high-availability applications with runtime features
Minimize fault-recovery time
Scheduling policies that reduce the probability of correlated failures

Solution:
Declarative job specification language
name service integration
real-time job monitoring
tools to analyze and simulate system behavior.

Cell:
Set of machines in a single cluster.

Job:
Set of tasks run same binary.
Jobs can have constraints to force its tasks to run on machines with particular attributes such as
processor architecture, OS version, or an external IP address.
Constraints can be hard or soft; the latter act like preferences
rather than requirements.
The start of a job can be deferred until a prior one finishes. A job runs in just one cell.
Job has a priority.

Task:
Each task maps to a set of Linux processes running in a container on a machine .
A task has properties too, such as its resource requirements and the task’s index within the job.
Most task properties are the same across all tasks in a job, but can be overridden –
e.g., to provide task-specific command-line flags.
Each resource dimension (CPU cores, RAM, disk space,
disk access rate, TCP ports,2 etc.) is specified independently
at fine granularity; we don’t impose fixed-sized buckets or slots
* Update the tasks to new specification.
* Updates are generally done in a rolling fashion,
and a limit can be imposed on the number of task disruptions.
Tasks can ask to be notified via a Unix SIGTERM signal before they are preempted by a SIGKILL,
so they have time to clean up, save state, finish any currently-executing
requests, and decline new ones.

In Cell:
1. never go down main services. Handle requests
2. Batch job services

Work loads across cells depend on tenant.

Allocs:
A reserved set of resources on a machine in which one or more tasks can be run.
The resources remain assigned whether or not they are used.
Allocs can be used to
1. set resources aside for future tasks,
2. to retain resources between stopping a task and starting it again,
3. and to gather tasks from different jobs onto the same machine .

Dominant Resource Fairness: (DRF)

Naming and monitoring:
e.g 50.jfoo.ubar.cc.borg.google.com

Borg also writes job size and task health information into
Chubby whenever it changes, so load balancers can see
where to route requests to.

Borgmaster:
1.
Handles client RPCs that either mutate state (e.g., create job).
2.
Provide read-only access to data (e.g., lookup job).
3.
Manages state machines for all of the objects in the system.
4.
communicates with the Agents
Borgmaster polls each Agent every few seconds to
retrieve the machine’s current state and send it any outstanding requests.
This gives Borgmaster
1). control over the rate ofcommunication,
2). avoids the need for an explicit flow control mechanism,
3). and prevents recovery storms
5. **
The elected master is responsible for preparing messages
to send to the Borglets and for updating the cell’s state with
their responses. For performance scalability, each Borgmas-
ter replica runs a stateless link shard to handle the communi-
cation with some of the Borglets; the partitioning is recalcu-
lated whenever a Borgmaster election occurs. For resiliency,
the Borglet always reports its full state, but the link shards
aggregate and compress this information by reporting only
differences to the state machines, to reduce the update load
at the elected master.

If a Borglet does not respond to several poll messages its
machine is marked as down and any tasks it was running
are rescheduled on other machines. If communication is
restored the Borgmaster tells the Borglet to kill those tasks
that have been rescheduled, to avoid duplicates. A Borglet
continues normal operation even if it loses contact with the
Borgmaster, so currently-running tasks and services stay up
even if all Borgmaster replicas fail.

Agent:
1. starts and stops tasks
2. restarts them if they fail;
3. manages local resources by manipulating OS kernel settings;
4. rolls over debug logs;
5. and reports the state of the machine to the Borgmaster and other monitoring systems.

Scheduling:
The scheduling algorithm has two parts:
1.
feasibility checking, to find machines on which the task could run,
In feasibility checking, the scheduler finds a set of machines that meet the task’s
constraints and also have enough “available” resources –
which includes resources assigned to lower-priority tasks that can be evicted.

2.
Scoring, which picks one of the feasible machines.
Scheduler determines the “goodness” of each feasible machine.
In practice, E-PVM ends up spreading load across
all the machines, leaving headroom for load spikes – but at
the expense of increased fragmentation, especially for large
tasks that need most of the machine; we sometimes call this
“worst fit”.

Scalability:
1.
added separate threads to talk to the Borglets and respond to read-only RPCs.
2.
sharded (partitioned) these functions across the five Borgmaster replicas
3.
Score caching
4.
caches the scores until the properties of the machine or task change
5.
Equivalence classes:
Tasks in a Borg job usually have
identical requirements and constraints, so rather than deter-
mining feasibility for every pending task on every machine,
and scoring all the feasible machines, Borg only does fea-
sibility and scoring for one task per equivalence class – a
group of tasks with identical requirements.
6.
Relaxed randomization: It is wasteful to calculate fea-
sibility and scores for all the machines in a large cell, so the
scheduler examines machines in a random order until it has
found “enough” feasible machines to score, and then selects
the best within that set.
Relaxed randomization is somewhat akin to the batch sam-
pling of Sparrow while also handling priorities, preemp-
tions, heterogeneity and the costs of package installation.

Availability:
1.
replication,
2.
storing persistent state in a distributed file system
3.
taking occasional checkpoints.
4.
A key design feature in Borg is that already-running tasks
continue to run even if the Borgmaster or a task’s Borglet
goes down. But keeping the master up is still important
because when it is down new jobs cannot be submitted
or existing ones updated, and tasks from failed machines
cannot be rescheduled.

Lessons and future work:
Bad and remedies:
1.
organizes scheduling units (pods) using labels.
arbitrary key/value pairs that users can attach to any object in the system.
2.
every pod and service gets its own IP address

Good:
1.
Allocs are useful.
2.
Cluster management is more than task management.
3.
Introspection is vital.
surface debugging information to all users rather than hiding it.
The master can be queried for a snapshot of its objects’ state.
4.
The master is the kernel of a distributed system.
5.
an API server at its core that is responsible only for processing requests and manipulating
the underlying state objects.

The cluster management logic is built as small, composable
micro-services that are clients of this API server, such as
the replication controller, which maintains the desired num-
ber of replicas of a pod in the face of failures, and the node
controller, which manages the machine lifecycle.

Ataraxia through Epoché

Oct 15, 2016

[paper][distributed computing] Large-scale cluster management at Google with Borg - note

No comments:

Post a Comment