Mar 24, 2023

[Ray][Notes] Papers and Design Doc

Reference:
Distributed ref counting protocol
Ownership: A Distributed Futures System for Fine-Grained Tasks


Ownership: A distributed Futures System for Fine-Grained Tasks

Focus on providing solutions to solve remote fine-grained tasks execution with:

  1. Remote/Distributed Futures based on concept of Rust's OwnerShip idea with C++'s Ref counting.
  2. Schedule Fine-Grained tasks which provides fault tolerance without sacrificing performance.
  3. OwnerShip with leader without consensus(better performance, low latency).
  4. Horizontal scaling based on leader concept(with meta data), localized storage(apart from leader, consider remote storage to the leader)


First, focus on remote futures.

  1. An object and its metadata are shared by its reference holders,
  2. The RPC executor that creates the object,
  3. Physical locations(that deference the data and processing it).
We use rpc(gPRC) as carrier for the network communication.
Basically, gRPC work with copy by value; however, in Ray, the 'value'
can be re-defined as metadata, ref counting, IP location information, etc.
Not the true 'value'.

Idea is straight, the caller who issues the remote execute owns the future's
metadata, i.e., the caller has the 'ownership' of the remote executed task's
returned 'future'.

Inside the 'returned' future 'value' should contain remote IP address/port, object ID, ref cnt, etc. metadata.




Second, focus on recovery.

  1. The design guarantees if the owner of a future is alive, any task that holds a reference to the future can eventually dereference the value. What if owner fails?
  2. Concept of 'lineage reconstruction'. Task who held the reference of the 'future' shares same lineage of Father ownership.
  3. Thus, if task fails, it can be recreated by it's remote owner, and the task will be fire-up on it's node again.
  4. Thus, it's safe to 'fate-share' the remote future with the remote task which both have the same owner.




API:

  • All objects in Ray system are immutable.
  • Concept of Rust's 'Borrower'. i.e Borrow's the future (not copy it, which increase ref count).
  • Borrower temporarily own's the future; in Ray, borrower has the signature called "sharedDFut".
  • DFut; aka. Distributed Future.



Failure detection:

  1. Automatic memory management through ref counting.
  2. System detects when a DFut  cannot be dereferenced due to worker/node failure.
  3. System has to record the locations of all tasks(that creates the DFut; while the value might not even exist yet), pending objects.

Failure recovery:

  1. Recover from a failed DFut. (At least throws when dereference DFut failed, it's not 'get' the value, but 'deref' the DFut)
  2. If passing DFut as by reference, during failure, the recovery should based on the runtime building each object's lineage, or the subgraph that produced the object. Using subgraph as event source thus the runtime could 'replay' it by recreating the objects that are needed. Subgraph is more light-weighted than logging.
  3. Multiple technique used for recovery: 1. read-only state, 2. checkpointable, 3. transient state.

Metadata:

  1. location, if using TCP/IP, than IP/Port/Host domain name(DNS cache)
  2. Object is still reference(ref cnt)
  3. location of pending object (i.e. task location)
  4. object lineage
  5. Stored on local node SSD(cachelib, mmap).


Use with Actor Model 


Schedule with ownership.


Memory management:

  1. Small object copy by value(Same as in system language, C++, Rust etc.)
  2. Large object known as primary, pinned on created host until owner release it.
    (Store on cachelib for example)


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.