👽
Software Engineer Interview Handbook
  • README
  • Behavioral
    • Useful Links
    • Dongze Li
  • Algorithm
    • Segment Tree
    • Array
      • Product Of Array Except Self
      • Merge Strings Alternately
      • Increasing Triplet Subsequence
      • String Compression
      • Greatest Common Divisor Strings
      • Max Product Of Three
      • Find Duplicate Num
      • Valid Palindrome Ii
      • Next Permutation
      • Rearrange Array By Sign
      • Removing Min Max Elements
      • Find Original Array From Doubled
      • Reverse Words Ii
    • Backtracking
      • Letter Combination Phone Number
      • Combination Sum Iii
      • N Queens
      • Permutations
      • Combination Sum
    • Binary Search
      • Koko Eating Bananas
      • Find Peak Element
      • Successful Pairs Of Spells Potions
    • Binary Search Tree
      • Delete Node In BST
      • Validate Bst
      • Range Sum Bst
    • Binary Tree
      • Maximum Depth
      • Leaf Similar Trees
      • Maximum Level Sum
      • Binary Tree Right Side
      • Lowest Common Ancestor
      • Longest Zigzag Path
      • Count Good Nodes
      • Path Sum III
      • Maximum Path Sum
      • Move Zero
      • Diameter Binary Tree
      • Sum Root Leaf Number
      • Traversal
      • Binary Tree Vertical Order
      • Height Tree Removal Queries
      • Count Nodes Avg Subtree
      • Distribute Coins
      • Binary Tree Max Path Sum
    • Bit
      • Min Flips
      • Single Number
      • Pow
      • Find Unique Binary Str
    • BFS
      • Rotten Oranges
      • Nearest Exist From Entrance
      • Minimum Knight Moves
      • Network Delay Time
      • Minimum Height Tree
      • Knight Probability In Board
    • Design
      • LRU Cache
      • Get Random
      • LFU Cache
      • Moving Average
      • Rle Iterator
      • Design Hashmap
    • DFS
      • Reorder Routes Lead City
      • Evaluate Division
      • Keys And Rooms
      • Number Of Provinces
      • Disconnected Path With One Flip
      • Course Schedule Ii
      • Robot Room Cleaner
      • Word Break Ii
      • Number Coins In Tree Nodes
      • Maximum Increasing Cells
      • Number Coins In Tree Nodes
      • Detonate Maximum Bombs
      • Find All Possible Recipes
      • Min Fuel Report Capital
      • Similar String Groups
    • DP
      • Domino And Tromino Tiling
      • House Robber
      • Longest Common Subsequence
      • Trade Stock With Transaction Fee
      • Buy And Sell Stock
      • Longest Non Decreasing Subarray
      • Number Of Good Binary Strings
      • Delete And Earn
      • Minimum Costs Using Train Line
      • Decode Ways
      • Trapping Rain Water
      • Count Fertile Pyramids
      • Minimum Time Finish Race
      • Knapsack
      • Count Unique Char Substrs
      • Count All Valid Pickup
    • Greedy
      • Dota2 Senate
      • Smallest Range Ii
      • Can Place Flowers
      • Meeting Rooms II
      • Guess the word
      • Minimum Replacement
      • Longest Palindrome Two Letter Words
      • Parentheses String Valid
      • Largest Palindromic Num
      • Find Missing Observations
      • Most Profit Assigning Work
    • Hashmap
      • Equal Row Column Pairs
      • Two Strings Close
      • Group Anagrams
      • Detect Squares
    • Heap
      • Maximum Subsequence Score
      • Smallest Number Infinite Set
      • Total Cost Hire Workers
      • Kth Largest Element
      • Meeting Rooms III
      • K Closest Points Origin
      • Merge K Sorted List
      • Top K Frequent Elements
      • Meeting Room III
      • Num Flowers Bloom
      • Find Median From Stream
    • Intervals
      • Non Overlapping Intervals
      • Min Arrows Burst Ballons
    • Linkedlist
      • Reverse Linked List
      • Delete Middle Node
      • Odd Even Linkedlist
      • Palindrome Linkedlist
    • Monotonic Stack
      • Daily Temperatures
      • Online Stock Span
    • Random
      • Random Pick With Weight
      • Random Pick Index
      • Shuffle An Array
    • Recursion
      • Difference Between Two Objs
    • Segment Fenwick
      • Longest Increasing Subsequence II
    • Stack
      • Removing Stars From String
      • Asteroid Collision
      • Evaluate Reverse Polish Notation
      • Building With Ocean View
      • Min Remove Parentheses
      • Basic Calculator Ii
      • Simplify Path
      • Min Add Parentheses
    • Prefix Sum
      • Find The Highest Altitude
      • Find Pivot Index
      • Subarray Sum K
      • Range Addition
    • Sliding Window
      • Max Vowels Substring
      • Max Consecutive Ones III
      • Longest Subarray Deleting Element
      • Minimum Window Substring
      • K Radius Subarray Averages
    • String
      • Valid Word Abbreviations
    • Two Pointers
      • Container With Most Water
      • Max Number K Sum Pairs
      • Is Subsequence
      • Num Substrings Contains Three Char
    • Trie
      • Prefix Tree
      • Search Suggestions System
      • Design File System
    • Union Find
      • Accounts Merge
    • Multithreading
      • Basics
      • Web Crawler
  • System Design
    • Operating System
    • Mocks
      • Design ChatGPT
      • Design Web Crawler
      • Distributed Search
      • News Feed Search
      • Top K / Ad Click Aggregation
      • Design Job Scheduler
      • Distributed Message Queue
      • Google Maps
      • Nearby Friends
      • Proximity Service
      • Metrics monitoring and alert system
      • Design Email
      • Design Gaming Leaderboard
      • Facebook New Feed Live Comments
      • Dog Sitting App
      • Design Chat App (WhatsApp)
      • Design Youtube/Netflix
      • Design Google Doc
      • Design Webhook
      • Validate Instacart Shopper Checkout
      • Design Inventory
      • Design donation app
      • Design Twitter
    • Deep-Dive
      • Back of Envelope
      • Message Queue
      • Redis Sorted Set
      • FAQ
      • Geohash
      • Quadtree
      • Redis Pub/Sub
      • Cassandra DB
      • Collaborative Concurrency Control
      • Websocket / Long Polling / SSE
    • DDIA
      • Chapter 2: Data Models and Query Languages
      • Chapter 5: Replication
      • Chapter 9: Consistency and Consensus
  • OOD
    • Overview
    • Design Parking
  • Company Tags
    • Meta
    • Citadel
      • C++ Fundamentals
      • 面经1
      • Fibonacci
      • Pi
      • Probability
    • DoorDash
      • Similar String Groups
      • Door And Gates
      • Max Job Profit
      • Design File System
      • Count All Valid Pickup
      • Most Profit Assigning Work
      • Swap
      • Binary Tree Max Path Sum
      • Nearest Cities
      • Exployee Free Time
      • Tree Add Removal
    • Lyft
      • Autocomplete
      • Job Scheduler
      • Read4
      • Kvstore
    • Amazon
      • Min Binary Str Val
    • AppLovin
      • TODO
      • Java Basic Questions
    • Google
      • Huffman Tree
      • Unique Elements
    • Instacart
      • Meeting Rooms II
      • Pw
      • Pw2
      • Pw3
      • Expression1
      • Expression2
      • Expression3
      • PW All
      • Expression All
      • Wildcard
      • Free forum tech discussion
    • OpenAI
      • Spreadsheet
      • Iterator
      • Kv Store
    • Rabbit
      • Scheduler
      • SchedulerC++
    • [Microsoft]
      • Min Moves Spread Stones
      • Inorder Successor
      • Largest Palindromic Num
      • Count Unique Char Substrs
      • Reverse Words Ii
      • Find Missing Observations
      • Min Fuel Report Capital
      • Design Hashmap
      • Find Original Array From Doubled
      • Num Flowers Bloom
      • Distribute Coins
      • Find Median From Stream
Powered by GitBook
On this page
  • Why replicate data?
  • Common Topics
  • Single Leader
  • Sync vs Async replication
  • Setting up new follower
  • Handling node outages
  • Replication Logs
  • Replication Lag
  • Multi-Leader Replication
  • Use Cases
  • Handling Write Conflicts
  1. System Design
  2. DDIA

Chapter 5: Replication

Replication means keeping a copy of the same data on multiple machines that are connected via a network.

Why replicate data?

  • Keep data geographically close to users, for example: CDN, edge cache (reduce latency)

  • Increase availability if some part of system failed.

  • Increase read throughput.

Common Topics

  • Synchronous vs Asynchronous Replication

  • How to handle failed replicas

Single Leader

  1. One of the replicas is designated the leader (master or primary). Writes to the database must send requests to the leader.

  2. Other replicas are known as followers (read replicas, slaves, secondaries or hot stanbys). The leader sends the data change to all of its followers as part of a replication log or change stream.

  3. Reads can be query the leader or any of the followers, while writes are only accepted on the leader.

Examples:

  • MySQL, MongoDB, Kafka, RabbitMQ

Sync vs Async replication

Sync
Async

Up-to-date copy of data

Stale data

Block the write if follower doesn't responde or crash

Non-blocking

Strong durability guarantee

Weakening durability if leader fails without recovery

If you enable synchronous replication on database, it means one of the replica is synchronous. This guarantee up-to-date data copy on two nodes. semi-synchronous

Weakening durability might sound bad but it's widely used.

Setting up new follower

Copying data files from one node to another is typically not sufficient.

Setting up a follower can usually be done without downtime. The process looks like:

  1. Take a snapshot of the leader’s database

  2. Copy the snapshot to the follower node

  3. Follower requests data changes that have happened since the snapshot was taken

  4. Once follower processed the backlog of data changes since snapshot, it has caught up.

Handling node outages

How does high availability works with leader-based replication?

follower failure

On local disk, the follower keeps a log of data changes received from leader. It can connect to the leader and request all data changes occurred during the downtime.

leader failure

One of the followers needs to be promoted to be the new leader, clients need to be reconfigured to send their writes to the new leader and followers need to start consuming data changes from the new leader.

Automatic failover consists:

  1. Determining that the leader has failed. If a node does not respond in a period of time it’s considered dead.

  2. Choosing a new leader. The best candidate for leadership is usually the replica with the most up-to-date changes from the old leader.

  3. Reconfiguring the system to use the new leader. The system needs to ensure that the old leader becomes a follower and recognises the new leader.

Things that could go wrong:

  • When a new leader is chosen after the failure of the old leader, there are two potential issues:

    1. Incomplete Writes: The new leader may not have received all the writes that were originally made to the old leader before it failed. This can lead to missing or inconsistent data if the new leader starts processing requests before catching up with all the previous writes.

    2. Conflicting Writes: During the leader transition, there's a possibility that both the old leader and the new leader receive writes concurrently. If these writes conflict (e.g., updating the same piece of data), it can result in data inconsistencies.

To handle conflicting writes, the system needs to have mechanisms in place to resolve these conflicts. One common approach is to timestamp each write and choose the write with the highest timestamp as the valid one. Another approach is to prioritize writes based on some predefined rules. Conflict resolution can also involve human intervention, especially in cases where automated resolution might not be sufficient.

In some cases, synchronous replication might be used, where writes are not acknowledged until they have been replicated to a certain number of nodes. This can help prevent data loss during leader transitions.

  • Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents.

Implementing two-phase commits or transactional mechanisms can ensure that changes across multiple storage systems are coordinated and atomic. Two-phase commit protocols involve a distributed transaction coordinator that ensures all involved systems commit or rollback changes together. This helps maintain data integrity across different systems.

  • It could happen that two nodes both believe that they are the leader (split brain). Data is likely to be lost or corrupted.

To prevent split brain scenarios, distributed systems often use quorum-based approaches. Quorums ensure that a majority of nodes must agree on the status of the leader before any decisions are made. This prevents the formation of multiple leaders simultaneously. For instance, a majority of nodes (N/2 + 1) must agree on a leader to avoid split brain scenarios in a cluster of N nodes.

  • What is the right time before the leader is declared dead?

Implementing various timeout mechanisms can help in determining leader failure. For example, a leader heartbeat mechanism where nodes regularly send signals to indicate their liveliness can be used. If other nodes detect a prolonged absence of these signals, they might initiate leader election. Additionally, quorum-based approaches can be employed to ensure that a leader is only declared dead when a majority of nodes agree on it.

Replication Logs

Statement replication

The leader logs every statement and sends it to its followers (every INSERT, UPDATE or DELETE)

Cons:

  • Non-deterministic functions such as NOW() or RAND() will generate different values on replicas.

  • Statements that depend on existing data, like auto-increments, must be executed in the same order in each replica.

  • Statements with side effects (triggers, stored procedures) may result in different outcome on each replica.

Solution: replace any non-deterministic function with a fixed return value in the leader.

Write-ahead log (WAL)

The log is an append-only sequence of bytes containing all writes to the database. We cam ise exact same log to build a replica on another node. The leader writes the log to disk and send them across the network to its followers.

Examples: PostgreSQL, Oracle.

Cons:

Log describes data at a very low level (which bytes were changed in which disk blocks), making it tightly coupled with storage engine. It's not possible to run different versions, hence impossible to have zero-downtime upgrade of database.

Logical (row-based) log replication - Change Data Capture

A sequence of records describing writes to database table at granularity of a row.

  • For an inserted row, the new values of all columns.

  • For a deleted row, the log contain information uniquely identify the row, primary key or old values.

  • For a updated row, new values of the columns.

Pros:

  • Logical log is decoupled from storage engine internals, it's easier to make it backwards compatible.

  • Easier to parse for external system like data warehouse for offline analysis, custom indexes and caches.

Trigger-based replication

A trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs. It's possible to log the change into a separate table which can be read by external process.

Example: Oracle's Databus, Bucardo for Postgres.

Cons:

It's on application layer. Greater overhead than other replication methods. More prone to bugs but might be useful due to its flexibility.

Replication Lag

If an application reads from an asynchronous follower, it may see outdated information if follower has fallen behind.

Reading your own writes (read-after-write consistency)

Read from leader based on:

  • If user modified the piece of information.

  • time of latest update.

Monotonic reads (read order)

A user first reads from a fresh replica, then from a stale replica. Time appears to go backward.

Monotonic Reads is a guarantee that "moving back in time" anomaly does not happen.

Make sure user always make their reads from the same replica. It can be chosen based on hash of user ID.

Consistent prefix reads (write order)

If some partitions are replicated slower than others, an observer may see the answer before they see the question.

This only happens in sharded db.

Writes casually related to each other are written to the same parition.

Solution for Replication Lag: Transactions

Multi-Leader Replication

Allow more than one node to accept writes. Each node processes a write will forward data change to all the other nodes.

Use Cases

Multi-datacenter operation

Have a leader in each datacenter. Each datacenter leader replicates its change to the leaders in other datacenters.

Single-leader configuration
Multi-leader configuration

Performance

Writes must go over internet to the datacenter with leader. Add significant latency if datacenter is not local.

Writes can be processed in local datacenter and replicated asynchronously to other datacenters. Performance is better.

Tolerance of datacenter outage

Failover to promote a follower in another datacenter as leader.

No Impact. Replication catches up when failed datacenter comes back online.

Clients with offline operation

If you have an application that needs to continue to work while it's offline. Every device that has a local database is a leader, and there will be some asynchronous multi-leader replication. (Imagine, a Calender app)

CouchDB is designed for this mode of operation.

Collaborative editing

When one user wants to edits a document, the changes are instantly applied to local replica, and asynchronously replicated to other users.

Google Docs allow multiple people to concurrently edit a text document or spreadsheet using Automatic Conflict Resolution.

For faster collaboration, you may want to avoid locking, but it also brings multi-leader replication challenge including requiring conflict resolution.

Handling Write Conflicts

Sync vs Async conflict detection

If you want synchronous conflict detection, you might as well just use single-leader replication.

On multi-leader if both writes are successful, the conflict is only detected async at later time, it might be too late for user to resolve the conflict.

Conflict Avoidance

The simplest strategy is to avoid conflict. If all writes for a particular record go through the same leader, then conflicts cannot occur.

In an application where a user can edit their own data, you can ensure requests from a particular user always routed to the same datacenter and use the same leader for reads and writes.

Converging towards a consistent state

On single-leader, the last write determines the final value.

On multi-leader, it's not clear what the final value should be.

Different ways of achieving convergent conflict resolution:

  • Last write wins: Give each write a unique ID (timestamp, UUID, or a hash of the key and value), pick the write with highest ID as winner and throw away the other writes. This is prone to data loss.

  • Higher replica wins: Same as above, prone to data loss.

  • Merge values: Order them alphabetically and concatenate. for example: B/C

  • Record conflict in an explicit data structure: Preserve all information and write application code to resolve conflict, perhaps prompting user.

Custom conflict resolution logic

Multi-leader replication tools let you write conflict resolution logic using application code.

  • On write: As soon as database system detects a conflict in the log of replicated changes, it calls the conflict handler. Ex: Bucardo

  • On read: All the conflicting writes are stored. On read, multiple versions of the data are returned to the application. The application may prompt the user or automatically resolve the conflict. Ex: CouchDB

Automatic conflict resolution

  • Conflict-free replicated datatypes (CRDTs) a family of data structures for sets, maps, ordered lists, counters that can be concurrently edited by multiple users, which automatically resolve conflicts in sensible ways. CRDTs have been implemented in Riak 2.0.

  • Mergeable persistent data structures track history explicitly, use three-way merge function similar to git.

  • Operational transformation conflict resolution algorithm behind collaborative editing applications like Google Docs.

PreviousChapter 2: Data Models and Query LanguagesNextChapter 9: Consistency and Consensus

Last updated 1 year ago