Oct 22, 2015

[RAFT][Note] In Search of an Understandable Consensus Algorithm paper note.

Paper:
[RAFT] In Search of an Understandable Consensus Algorithm

Raft separates the key elements of consensus, such as
1. leader election,
2. log replication,
3. safety,
4. it enforces a stronger degree of coherency to reduce the number of
states that must be considered.





Raft Arch:
1. Strong leader:
log entries only flow from the leader to other servers.

2. Leader election:
Raft uses randomized timers to elect leaders.

3.  Membership changes:
Raft’s mechanism for changing the set of servers in
the cluster uses a new joint consensus approach
where the majorities of two different configurations
overlap during transitions. (不懂..)
This allows the cluster to continue operating
normally during configuration changes.



------------
The remainder of the paper introduces the replicated
state machine problem (Section 2),
describes our general approach to understandability (Section 4),
presents the Raft consensus algorithm (Sections 5–8),
evaluates Raft (Section 9), and discusses related work (Section 10).
-------------



Replicated state machines:
Replicated state machines are typically implemented using a replicated log.
(一個log有多個command)

Each server stores a log containing a series of commands, which its
state machine executes in order.

Each log contains the same commands in the same order,
so each state machine processes the same sequence of commands.

Since the state machines are deterministic, each computes the
same state and the same sequence of outputs.

Keeping the replicated log consistent is the job of the
consensus algorithm. The consensus module on a server
receives commands from clients and adds them to its log.


Consensus algorithms for practical systems typically
have the following properties:
1. They ensure safety.
(never returning an incorrect result)
under all non-Byzantine conditions,
including network delays, partitions, and packet loss, duplication,
and reordering.

2. They are fully functional (available) as long as any
majority of the servers are operational and can communicate
with each other and with clients.

3. They do not depend on timing to ensure the consistency of the logs:
faulty clocks and extreme message delays can, at worst, cause availability problems.

4. In the common case, a command can complete as
soon as a majority of the cluster has responded to a
single round of remote procedure calls; a minority of
slow servers need not impact overall system performance.



The Raft consensus algorithm:
1.  electing a distinguished leader.
2.  giving the leader complete responsibility for managing the replicated log.
3.  The leader accepts log entries from clients, replicates them on other
servers.
4. The leader tells servers when it is safe to apply log entries to
their state machines. (Having a leader simplifies the management
of the replicated log. )
5. A leader can fail or become disconnected from the other servers, in which
case a new leader is elected.


Given the leader approach, Raft decomposes the consensus
problem into three relatively independent subproblems:

1. Leader election:
a new leader must be chosen when an existing leader fails

2.  Log replication:
the leader must accept log entries from clients and replicate them across the
cluster, forcing the other logs to agree with its own.

3. Safety:
the key safety property for Raft is the State Machine Safety Property.
    Election Safety: at most one leader can be elected in a given term.
    Leader Append-Only: a leader never overwrites or deletes entries in its log;
        it only appends new entries.
    Log Matching: if two logs contain an entry with the same index and term,
        then the logs are identical in all entries up through the given index.
    Leader Completeness: if a log entry is committed in a given term, then that
        entry will be present in the logs of the leaders for all higher-numbered
        terms.
    State Machine Safety: if a server has applied a log entry at a given index
        to its state machine, no other server will ever apply a different log
        entry for the same index.

     

Raft basics:
1. A Raft cluster contains several servers; five is a typical
number, which allows the system to tolerate two failures.

2. At any given time each server is in one of three states:
    leader (handles all client requests) (if a client contacts a follower, the
                    follower redirects it to the leader)
    follower (servers start up as follower state.
                    they issue no requests on
                    their own but simply respond to requests from leaders
                    and candidates.)
    candidate (used to elect a new leader)

3.  In normal operation there is exactly one leader and all of the other
servers are followers.

4. Raft divides time into terms of arbitrary length.
Terms are numbered with consecutive integers.

5. Each term begins with an election, in which one
or more candidates attempt to become leader.

6. If a candidate wins the election, then it
serves as leader for the rest of the term.

7. In some situations
an election will result in a split vote. In this case the term
will end with no leader; a new term (with a new election)
will begin shortly. Raft ensures that there is at most one
leader in a given term.

8. Different servers may observe the transitions between
terms at different times, and in some situations a server
may not observe an election or even entire terms.

9. Terms act as a logical clock in Raft, and they allow servers
to detect obsolete information such as stale leaders.

10. Each server stores a current term number, which increases
monotonically over time.

11. Current terms are exchanged
whenever servers communicate; if one server’s current
term is smaller than the other’s, then it updates its current
term to the larger value. If a candidate or leader discovers
that its term is out of date, it immediately reverts to follower state.

If a server receives a request with a stale term
number, it rejects the request.


12. Raft servers communicate using remote procedure calls
(RPCs), and the basic consensus algorithm requires only
two types of RPCs:
    *RequestVote RPCs are initiated by candidates during elections.
    *AppendEntries RPCs are initiated by leaders to replicate log entries and
        to provide a form of heartbeat.
    ~a third RPC for transferring snapshots between servers.
 
13. Servers retry RPCs if they do not receive a response in a timely manner,
and they issue RPCs in parallel for best performance.


Leader election:
1. Use a heartbeat mechanism to trigger leader election.
2. When servers start up, they begin as followers.
A server remains in follower state as long as it receives valid
RPCs from a leader or candidate.

3. Leaders send periodic heartbeats [broadcast? Unlikely since it's RPC call.
how the leader find out the followers? Or followers register itself to the
leader? How?]
(uses AppendEntries RPCs that carry no log entries)
to all followers in order to maintain their authority.

4. If *a follower* receives no communication over a period of time
called the [election timeout], then it assumes there is no viable leader and
begins an election to choose a new leader.

5. To begin an election, a follower increments its current
term and transitions to candidate state.

6. It then [votes for itself] and [issues RequestVote RPCs] in parallel to each
of the other servers in the cluster.

7. A candidate continues in
this state until one of three things happens:

(a) it wins the election (wins an election if it receives votes from
a majority of the servers in the full cluster for the same
term. Each server will vote for at most one candidate in a
given term, on a first-come-first-served basis.
Once a candidate wins an election, it becomes leader.
It then sends heartbeat messages to all of
the other servers to establish its authority and prevent new
elections.

While waiting for votes, a candidate may receive an
AppendEntries RPC from another server claiming to be
leader. If the leader’s term (included in its RPC) is at least
as large as the candidate’s current term, then the candidate
recognizes the leader as legitimate and returns to follower
state.

If the term in the RPC is smaller than the candidate’s
current term, then the candidate rejects the RPC and con-
tinues in candidate state.

While  a candidate neither wins nor loses the election:
if many followers become candidates at the same time, votes could be split so
that no candidate obtains a majority.
When this happens, each candidate will time out and start a new election by
incrementing its term and initiating another round of RequestVote RPCs.
However, without extra measures split votes could re-Raft uses randomized
election timeouts to ensure that split votes are rare and that they are
resolved quickly.

To prevent split votes in the first place, election timeouts are
chosen randomly from a fixed interval (e.g., 150–300ms).
peat indefinitely.)

(b) another server establishes itself as leader, or
(c) a period of time goes by with no winner.


Log replication:
1. Once a leader has been elected, it begins servicing
client requests. 
[注意! 所以的client request都由 leader處裡!
包含read/remove/update!]
[log replicate目的是fault tolerance!]

2. Each client request contains a command to
be executed by the replicated state machines.
The leader appends the command to its log as a new entry,
then issues AppendEntries RPCs in parallel to each of the other
servers to replicate the entry.

3. ***
When the entry has been safely replicated, the leader
applies the entry to its [state machine] and returns the result of that
execution to the client. [何謂 safely replicated?]
為什麼要majority有log的replicate,也是因為fault tolerance!
這樣才能在leader election時選出含有最新的servers來參與
election!!

**
If followers crash or run slowly,
or if network packets are lost, the leader retries AppendEntries RPCs
indefinitely (even after it has responded to the client) until all followers
eventually store all log entries.


4. Each log entry stores a state machine command along with the term
number when the entry was received by the leader.

The term numbers in log entries are used to detect inconsistencies
between logs.

Each log entry also has an integer index identifying its position in the log.


5. The leader decides when it is safe to apply a log entry to the state
machines; such an entry is called committed. [怎麼decide?]

6. *** 多數follower有log entry 的replicate 就算是safely replicated! 
即稱為commited!

A log entry is committed once the leader
that created the entry has replicated it on a majority of the servers.

Majority的用處在於 leader election時,可以正確選出具有參選資格的
servers!!!


This also commits all preceding entries in the leader’s log, including entries
created by previous leaders.

7. ***
The leader keeps track of the highest index it knows
to be committed, and it includes that index in future
AppendEntries RPCs (including heartbeats) so that the
other servers eventually find out. Once a follower learns
that a log entry is committed, it applies the entry to its
local state machine (in log order).


High level of coherency between the logs on different servers:
1. If two entries in different logs have the same index
and term, then they store the same command.
2. If two entries in different logs have the same index
and term, then the logs are identical in all preceding entries.
**
The first property follows from the fact that a leader
creates at most one entry with a given log index in a given
term, and log entries never change their position in the
log.
[leader只存一個log含此index + term. 並且,log entry在
所有log裡維持一致的順序。順序不變! 這樣才能發展出之後的log consistency!]

The second property is guaranteed by a simple con-
sistency check performed by AppendEntries. When send-
ing an AppendEntries RPC, the leader includes the index
and term of the entry in its log that immediately precedes
the new entries. If the follower does not find an entry in
its log with the same index and term, then it refuses the
new entries.
[即leader送出log replicate AppendEntries RPC時,
會將此log之前的log (index + term) 附上。接收到的server會檢查
此之前的log有沒有存在在其log repository裡。若沒有,表示有漏接的,
並拒絕此一次的log replicate 請求]

**
As a result, whenever AppendEntries returns successfully,
the leader knows that the follower’s log is identical to its
own log up through the new entries. [所以說,當leader發出的
log replicate RPC成功時,表示對方server與leader內的log是一致的]

--------------------
當不一致時怎麼辦呢??
In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own.

Steps:
即leader 與 follower 找到一個最低共同的log entry.
follower在此log entry之後的都刪除,leader將此log entry之後的都
copy給 follower.

1. leader對每一個follower 維護一個variable : nextIndexlatest.
2. leader送出 AppendEntries RPC 給follower,若回傳失敗,
則leader降低nextIndexlatest再送一遍.
3. 加速法: follower在找不到leader送出的此一個log entry的前一個
match時,將此term的最新一個在此follower內的log entry回傳給leader,
leader可以快速的decrease nextIndexlatest到此一term的index 而不用
一個一個的來回RPC check.

*** A leader never overwrites or deletes entries in its own log ***
---------------------

Safety:
which servers may be elected leader???

The restriction ensures that the leader for any given term(term_N)
contains all of the entries committed in previous terms(term_N-1).

Election restriction:

1.  guarantees that all the committed entries from previous
terms are present on each new leader from the moment of
its election, without the need to transfer those entries to
the leader.

2.  log entries only flow in one direction, from leaders to followers, and
leaders never overwrite existing entries in their logs.

3. 實做:
    a. 透過voting process來保證參與的candidate有前一個term所有的commit.
    b. candidate必須與majority cluster接觸.
    c. RequestVote RPC實做了如果一個candidate收到對方candidate的RPC request,
    而其上一個term的最新log entry沒有這個被call的candidate所有的log entry 新,
    則此calling candidate並不含有所有上一個term的log,因此不能參與成為候選人。
    d. Raft determines which of two logs is more up-to-date
by comparing the index and term of the last entries in the
logs. If the logs have last entries with different terms, then
the log with the later term is more up-to-date. If the logs
end with the same term, then whichever log is longer is
more up-to-date.





No comments:

Post a Comment

Note: Only a member of this blog may post a comment.