🔷 1. Why Heartbeats?

❗ Problem

Servers need to know:
- Who is alive
- Who has failed
Required for:
- Request routing
- Failover handling

✅ Solution

Servers send periodic heartbeat signals → "I am alive"

🔥 FAANG Question

Q: Why is failure detection critical in distributed systems? A: To reroute traffic and recover quickly from failures

🧠 Script

"Heartbeats are used to detect failed nodes so the system can maintain availability."

🔷 2. What is a Heartbeat?

✅ Definition

A periodic signal/message sent by a server to indicate it is alive

🔁 Two Models

🏢 Centralized

All servers → send heartbeat to central monitor

🌐 Decentralized (Gossip-style)

Servers → send heartbeats to random peers

🔥 FAANG Question

Q: Centralized vs decentralized heartbeat? A: Centralized = simple but single point of failure. Decentralized = scalable but complex.

🧠 Script

"Heartbeats can be centralized or distributed depending on system design."

🔷 3. How It Works

✅ Flow

Server sends heartbeat periodically
Receiver tracks last heartbeat time
If timeout exceeded → mark node as failed

🔥 FAANG Question

Q: How does a system detect failure using heartbeats? A: If no heartbeat is received within a timeout → node is considered dead

🧠 Script

"If heartbeat stops beyond a timeout, the node is marked as failed."

🔷 4. After Failure Detection

✅ Actions Taken

Stop sending traffic to failed node
Reassign data/work
Trigger recovery (failover)

🔥 FAANG Question

Q: What happens after a node failure is detected? A: Traffic is rerouted and data/work is redistributed

🧠 Script

"After detecting failure, the system reroutes requests and recovers data."

🔷 5. Key Parameters

⚙️ Important Settings

Heartbeat interval → how often signals are sent
Timeout → how long to wait before declaring failure

❗ Trade-off

Short interval → fast detection but high overhead
Long interval → slower detection but less overhead

🔥 FAANG Question

Q: How do you tune heartbeat systems? A: Balance between detection speed and network overhead

🧠 Script

"Heartbeat tuning balances fast failure detection with system overhead."

🔷 6. Challenges

❌ Issues

Network delays → false failure detection
Network partitions
Scaling overhead in large clusters

🔥 FAANG Question

Q: What is a false positive in heartbeat systems? A: When a node is alive but appears dead due to network delay

🧠 Script

"Network issues can cause false failure detection in heartbeat systems."

🔷 7. Real-World Use Cases

✅ Used In

Cluster management
Distributed DBs → Apache Cassandra
Container orchestration → Kubernetes
Load balancers

🔥 FAANG Question

Q: Where are heartbeats used in real systems? A: In clusters, databases, and orchestration tools to monitor node health

🧠 Script

"Heartbeats are widely used to monitor node health in distributed systems."

🔷 8. Interview Gold Points (Often Missed)

⭐ Important

Used with leader election
Enables auto-scaling & failover
Often combined with gossip protocols
Not 100% accurate → probabilistic failure detection
Helps prevent cascading failures

🔥 FAANG Question

Q: Why are heartbeats not perfectly reliable? A: Because network delays can mimic failures

🧠 Script

"Heartbeat-based detection is best-effort, not perfectly accurate."

🚀 Final 15-sec Interview Answer

"Heartbeats are periodic signals sent by servers to indicate they are alive. If a heartbeat is not received within a timeout, the node is considered failed, and the system reroutes traffic and recovers data. Heartbeats can be centralized or decentralized and are widely used for failure detection in distributed systems, though they must be carefully tuned to balance detection speed and overhead."

Heartbeat - Failure Detection in Distributed Systems - FAANG Guide

Table of Contents

🔷 1. Why Heartbeats?

❗ Problem

✅ Solution

🔥 FAANG Question

🧠 Script

🔷 2. What is a Heartbeat?

✅ Definition

🔁 Two Models

🏢 Centralized

🌐 Decentralized (Gossip-style)

🔥 FAANG Question

🧠 Script

🔷 3. How It Works

✅ Flow

🔥 FAANG Question

🧠 Script

🔷 4. After Failure Detection

✅ Actions Taken

🔥 FAANG Question

🧠 Script

🔷 5. Key Parameters

⚙️ Important Settings

❗ Trade-off

🔥 FAANG Question

🧠 Script

🔷 6. Challenges

❌ Issues

🔥 FAANG Question

🧠 Script

🔷 7. Real-World Use Cases

✅ Used In

🔥 FAANG Question

🧠 Script

🔷 8. Interview Gold Points (Often Missed)

⭐ Important

🔥 FAANG Question

🧠 Script

🚀 Final 15-sec Interview Answer