Logo
Published on

Redundancy and Replication - FAANG System Design Interview Guide

🔷 1. Redundancy (Core Idea)

🧠 Definition (Interview Opening)

"Redundancy is duplicating critical components to eliminate single points of failure and improve reliability."


⚡ Why it matters

  • Prevent data loss
  • Ensure system keeps running
  • Improve availability

🎯 Analogy

💾 Google Docs:

  • Your doc saved on multiple servers → one server fails → no data loss

🎤 FAANG Question

Q: Why is redundancy important? A:

"It removes single points of failure and ensures high availability during failures."


🚨 2. Single Point of Failure (SPOF)

👉 Anything whose failure = system down

Example

  • One DB → crash = system down ❌
  • Replicated DB → failover ✅

🎤 FAANG Question

Q: How do you remove SPOF? A:

"By introducing redundancy via replication, multiple instances, and failover mechanisms."


🔷 3. Replication (Core Concept)

🧠 Definition

"Replication is copying data from a primary node to one or more replicas to improve availability, fault tolerance, and scalability."


🔧 Basic Flow

Primary (write) → Replicas (read)

⚡ Why Replication?

  • High availability
  • Read scaling
  • Disaster recovery

🎤 FAANG Question

Q: Why use replication? A:

"To improve availability, scale reads, and ensure durability of data."


⚙️ 4. Types of Replication (VERY IMPORTANT)


🟢 1. Synchronous Replication (Strong Consistency)

🧠 Idea

"Write succeeds only after replicas confirm"


✅ Pros

  • Strong consistency
  • No data loss

❌ Cons

  • High latency
  • Slower writes

🎯 Analogy

✍️ Writing + photocopy instantly before submitting


🎤 FAANG Question

Q: When use synchronous replication? A:

"When consistency is critical, like banking systems."


🔵 2. Asynchronous Replication (Eventual Consistency)

🧠 Idea

"Primary writes first, replicas update later"


✅ Pros

  • Fast writes
  • High availability

❌ Cons

  • Data lag
  • Possible data loss

🎯 Analogy

📩 WhatsApp message → delivered later


🎤 FAANG Question

Q: Risk of async replication? A:

"Replica lag can cause stale reads or data loss on failure."


🟡 3. Semi-Synchronous (Balanced 🔥)

🧠 Idea

"Wait for at least one replica, others async"


✅ Pros

  • Better consistency than async
  • Better performance than sync

🎤 FAANG Question

Q: Why semi-sync is used? A:

"To balance consistency and latency in production systems."


⚖️ 5. Comparison (Must Remember)

Type Consistency Latency Data Loss
Sync Strong High No
Async Weak Low Possible
Semi-sync Medium Medium Low

🔥 6. Missing but CRITICAL Concepts


🧠 A. Read vs Write Scaling

  • Writes → Primary
  • Reads → Replicas

👉 "Scale reads horizontally"


🧠 B. Replica Lag (VERY IMPORTANT)

👉 Delay between primary & replica

Problem:

  • User sees stale data

🎤 FAANG Question

Q: How handle stale reads? A:

"Use read-after-write consistency or route critical reads to primary."


🧠 C. Failover (VERY IMPORTANT)

👉 When primary fails:

  • Promote replica → new primary

Types:

  • Manual
  • Automatic (preferred)

🎤 FAANG Question

Q: What happens when primary fails? A:

"A replica is promoted to primary via failover mechanisms."


🧠 D. Multi-Leader / Leaderless (ADVANCED 🔥)

Single Leader (most common)

  • One write node

Multi-leader

  • Multiple writable nodes

Leaderless (e.g., DynamoDB)

  • No primary

🎤 FAANG Question

Q: Why not always multi-leader? A:

"Because it introduces conflict resolution complexity."


🧠 E. CAP Theorem Connection 🔥

  • Sync → CP (consistency)
  • Async → AP (availability)

⚠️ 7. Trade-offs

Benefit Cost
High availability Complexity
Fault tolerance Replication lag
Read scaling Consistency issues

🚀 8. Real System Example (FAANG Thinking)

Example: Instagram

  • Primary DB → writes
  • Replicas → serve feed

👉 High read traffic → replicas handle load


🎤 9. FAANG Interview Script

Start

"To improve availability and scalability, I'll introduce database replication."


Explain

"Writes go to primary, and replicas serve read traffic."


Trade-off

"Async replication improves performance but introduces eventual consistency."


Add Depth

"We can use semi-sync to balance latency and consistency."


Failure Handling

"In case of failure, a replica will be promoted via failover."


Close Strong

"This removes single point of failure and enables horizontal scaling."


🧠 Final One-Line (Must Memorize)

"Replication improves availability and read scalability by duplicating data across nodes."


💡 FAANG-Level Insight (DIFFERENTIATOR)

"Replication is not just about backup — it's about scaling reads and surviving failures without downtime."