Replication in Distributed Systems

System Model

  • Assume synchronous system with crashing the only fail
  • Replica Managers contain the replica in a computer and perform operations upon them
  • Front-end issues request to one or more replica managers
  • Coordination: Replica managers coordinate the requests FIFO, Casual, Total
  • Replica manager pass back the first result for responsiveness or what majority has reply for byzantine failures

Group Communication:

Group Membership Service:

  1. Interface for group membership changes
  2. Failure detector: To track failure of members
  3. Notifying of membership changes:
  4. Group address expansion: change address from single to all members

IP multicast has all except 2 and 3.

Group views: Ordered lists of current group members.

A member can be exclude if it is suspected even it has not crashed.

View Delivery:

  • Instead of query every member views of the group are delivery to applications
  • They follow Order, Integrity and Non-triviality (if two processes join they can communicate and if they split in two groups that should be reflected)

View-Synchronous Group Communication:

  • Agreement: Same set of message delivered in any given view
  • Integrity: A message is delivered just once
  • Validity: If a process fails a new view is immediately send.

Fault-Tolerant Services

Linearizability:

  • Interleaved sequence meets the spec of a correct copy of the objects
  • Order of operations is consistent with the real times at which they occurred

Sequential Consistency:

  • Interleaved sequence meets the spec of a correct copy of the objects
  • Order of operations is consistent with the program order where those where executed

Passive (primary-backup) Replication

  • The primary is replace by one backup, otherwise it would have been incorrect
  • Replica managers agree on which operations had been performed when secondary takes over
  • Can be used even in a non-deterministic way, multi-threaded apps
  • Requires f + 1 replica managers

Active Replication

  • Each request is sent to all the servers and executed simultaneously
  • This schema achieve sequenciality
  • Total order in which the replica servers process request is not the same as the time order in which the client requested
  • 2f+1 replica managers for byzantine failures

High-Available Services

The active behaviour of waiting for all the replica managers to reply undesirable for highly available operations

Gosssip Architecture:

Two basic operations, queries (RO) and updates (modify without reading)

The system guarantees the following, even if the replica managers can communicate with each other

  • Each client obtains consistent service over time: Replica managers provide data reflecting at least the updates observed by that client so far
  • Relaxed consistency between replicas: All replica managers received all the updates and apply them in an order to suit the application needs

Supports causal update ordering:

  • Immediate ordering: Applied causal and forced ordering are no happened-before related
  • Causal ordering: Posting items, they can be ordered by subject: Oranges, “Re: Oranges”
  • Forced-ordering: members joining in the same order
  • Immediate-ordering: deleting a member, so a user can not being active for some late replica

Outline of a gossip service:

  1. Request: Front-end sends to only a single replica manager. Block in query, return asap in updates but client may be held till update has been sent to f+1 replica managers to avoid f failures
  2. Update Response: Replica managers replies as soon as received an update
  3. Coordination: A request is held to apply the ordering, maybe waiting for other replica manager updates.
  4. Execution
  5. Query Response
  6. Agreement: Replica managers update one another by gossip messages. Messages exchanged occasionally, lazy fashion or when a replica managers finds out that it is missing an update to send.

Front end’s timestamp:

  • Front-ends keep their own vector timestamp passing it to replica manager and merging when they receive it back
  • They pass the timestamp to clients

Replica Manager State:

  • Value: Each Replica Manager is a state machine and changes applying updates operations
  • Value Timestamp: Vector timestamp that represents updates reflected in the value. One entry for every replica manager.
  • Update Log: 1. It can not apply an update yet 2. after being applied it has been in other replica managers
  • Replica Timestamp: vector timestamp representing accepted updates. Maybe different from value timestamp
  • Execute Operation Table: Preventing updates applied when coming from Front end and other replica managers, it uses front-end unique identifiers.
  • Timestamp Table: Timestamps from other replica managers received in gossip messages, used to know when updates have been applied to all other.
  • Each position of the vector timestamp are updates received from the ith replica manager.

Query Operations:

  • The query operation contains a timestamp reflecting the latest version of the value the front-end has read/updated.
  • The replica managers must return a value, at least, as recent as the timestamp indicates

Updates operations in causal order:

The front-end assigns an identifier and adds the timestamp

The replica managers checks whether the id has been previously processed, if not it updates the ith position of the vector timestamp and send it back to the front-end.

The front-end merges the timestamp

Forced and immediate update operations:

  • Forced: Append identifier to timestamp and process them in that order by a Primary Replica Manger
  • Immediate: Primary server synchronize with other replica managers to reach the order.

Gossip Messages:

  • Contains the log and the replica timestamp
  • Merge arriving log with its own log
  • Apply stable updates (and no executed before)
  • Eliminate records from log when they have been applied everywhere

Update Propagation:

  • Random: Replica managers randomly choose a partner to synch
  • Topological: Replica managers arranged in a graph
  • Tuned to the application

Discussion:

  • Lazy approach makes gossip inappropriate for near real-time requirements
  • Bad scalability, as the number of replica managers grows so does the number of gossips.
  • Making some replica managers RO may improve scalability

Bayou and the Operational Transformation Approach:

  • Replica Managers exchange updates in pairs
  • Users make any updates, when replica managers merge the updates and resolve conflicts (e.g. boss appointments first)
  • Updates marked as tentative when they can be rearranged. They move to committed.
  • Before update operation, replica managers perform dependency checks if indicates a conflict a merge is applied
  • Replication is non-transparent to the application

Coda Filesystem:

  • Focus on High Availability despite disconnected operation
  • Replication began to be a problem in AFS, for widely accessed shared files
  • Provides users with share file repository and allow the to rely on local resources when repository non-accsible.
  • Relies on replication of file volumes to achieve hight throughput
  • Vice process, the same as Replica Managers
  • Venus, mix of Front end and Replica managers
  • VSG, Volume Storage Group, servers holding replicas of a file volume.
  • Changes notified to clients via callback and on close, modified files are broadcast in parallel to all the servers.
  • Code client caches all the files
  • CVV, Coda Version Vector, vector timestamp each entry is the updates on each server
  • An open in code provides the most recent copy of the file from the current AVSG, if all servers down, a cached copy is used
  • Venus processes (client) must check for AVSG shrink/growth
  • Performance decreases as more users and fold replication enabled