Logo
Published on

The Steps to Solve System Design Interviews - Complete FAANG Guide

Table of Contents

๐Ÿ”ท PHASE 1: UNDERSTAND, SCOPE & CONSTRAINTS

1. Clarify Ambiguity + Define Scope

  • What EXACT system?
  • What to exclude?
  • Who are users?

๐Ÿ’ฌ Say:

"Before jumping into design, I'd like to clarify scope to avoid assumptions. Are we focusing on core features like feed, posting, and interactions, or should we include messaging, ads, and notifications as well? For this discussion, I'll focus on News Feed, posts, likes, and follows, and exclude messaging and ads to keep scope manageable."


2. Functional Requirements

  • Core user actions only (5โ€“7)

๐Ÿ’ฌ Say:

"Let me list the core functional requirements โ€” the key actions users perform in the system."

๐Ÿ’ฌ Example:

  • Create post
  • Follow/unfollow users
  • View news feed
  • Like/comment
  • View profile

๐Ÿ’ฌ Add:

"I'm focusing only on the most critical flows to keep the design focused."


3. Non-Functional Requirements

Core:

  • Scale (users, QPS)
  • Latency (p95/p99)
  • Availability (SLA)
  • Consistency (strong vs eventual)
  • Read vs write ratio

Advanced (often missed):

  • Durability
  • Fault tolerance
  • Reliability
  • Cost constraints

๐Ÿ’ฌ Say:

"Now I'll define non-functional requirements โ€” these drive most design decisions."

๐Ÿ’ฌ Example:

"System should support ~100M DAU. Expected read-heavy workload with ~10:1 read/write ratio. p99 latency should be under 200ms for feed reads. Availability target is 99.99%. Feed can be eventually consistent, but post creation must be strongly consistent. System must be fault-tolerant, durable, and cost-efficient."


4. Constraints & Assumptions

  • Unknowns โ†’ make assumptions explicit

๐Ÿ’ฌ Say:

"Since some details are unspecified, I'll make a few reasonable assumptions and call them out."

๐Ÿ’ฌ Example:

"I'll assume average user follows 200 people and opens feed 10 times/day. I'll refine these if needed."


๐Ÿ”ท PHASE 2: ESTIMATION & INTERFACE

5. Capacity Estimation

Traffic:

  • DAU
  • Requests per user
  • Actions per user
  • Peak vs avg QPS

Data:

  • Payload size
  • Storage/day/year

Bandwidth:

  • Ingress/egress

Growth:

  • 2โ€“3 year projection

๐Ÿ’ฌ Say:

"Now I'll do back-of-the-envelope calculations to estimate scale."

๐Ÿ’ฌ Example:

"100M DAU ร— 10 feed loads = 1B requests/day That's ~10โน requests/day โ‰ˆ 100K peak QPS.

200M posts/day ร— 1KB = 200 ร— 10โน bytes โ‰ˆ 200 GB/day.

Over 1 year: 200 GB ร— 365 โ‰ˆ 73 TB/year.

Media is much larger, so we store it in object storage and serve via CDN."


๐Ÿ”ข Numbers Everyone Should Know (Latency)

๐Ÿ’ฌ Say (VERY IMPORTANT):

"I'll use standard latency numbers to justify caching and storage decisions."

  • L1 cache: 0.5 ns
  • RAM: 100 ns
  • SSD (1MB read): 1 ms
  • Network (1MB): 10 ms
  • Cross-region RTT: 150 ms

๐Ÿ’ฌ Add Insight:

"Since network calls are orders of magnitude slower than memory, caching is critical to reduce latency."


Conversions (MEMORIZE THESE)

Name Power Value
Thousand 10ยณ 1K
Million 10โถ 1M
Billion 10โน 1B
Trillion 10ยนยฒ 1T

๐Ÿ’ฌ Say:

"I'll use powers of 10 for quick estimation."


6. API Design

  • REST/gRPC/GraphQL
  • Pagination
  • Idempotency
  • Versioning

๐Ÿ’ฌ Say:

"Next, I'll define API contracts between client and backend."

๐Ÿ’ฌ Example:

POST /v1/posts
GET /v1/feed?cursor=abc&limit=20
POST /v1/like

๐Ÿ’ฌ Add (IMPORTANT):

"APIs are versioned for backward compatibility and designed to be idempotent where needed."

๐Ÿ’ฌ Explain (interviewer loves this):

"Idempotency ensures retries don't create duplicate actions. For example, POST requests can use an idempotency key to prevent duplicate writes."


๐Ÿ”ท PHASE 3: HIGH-LEVEL DESIGN

7. High-Level Architecture

  • End-to-end flow
  • Service separation

๐Ÿ’ฌ Say:

"Now I'll walk through the high-level architecture."

"Client โ†’ CDN โ†’ Load Balancer โ†’ API Gateway โ†’ Microservices โ†’ Cache โ†’ Database โ†’ Queue"


๐Ÿ’ฌ Add:

"Services are stateless and horizontally scalable."


8. Data Model & Access Patterns

  • Entities
  • Query patterns (CRITICAL)

๐Ÿ’ฌ Say:

"Before choosing databases, I'll define data model and access patterns."

๐Ÿ’ฌ Example:

"The most critical query is: 'Fetch latest posts from followed users sorted by time.'"


๐Ÿ’ฌ Add Insight:

"Access patterns drive database choice โ€” not the other way around."


9. Database Selection


๐Ÿ’ฌ Say:

"Based on access patterns, I'll choose appropriate databases."

๐Ÿ’ฌ Example:

"Cassandra for feed (high write throughput, time-series data). Relational DB for user relationships. This is a polyglot persistence approach."


๐Ÿ”ท PHASE 4: DETAILED DESIGN (CORE OF INTERVIEW)

10. Core Component Deep Dive

๐Ÿ’ฌ Say:

"Now I'll deep dive into the most complex component โ€” the feed system."


๐Ÿ’ฌ Example:

"I'll use a hybrid fan-out approach: Fan-out on write for normal users and fan-out on read for celebrities."


11. Caching Strategy (DEEPER)


๐Ÿ’ฌ Say:

"Caching is critical to reduce latency and database load."

๐Ÿ’ฌ Example:

"Redis stores precomputed feeds. Cache-aside pattern is used. TTL ensures freshness. Hot users may have shorter TTL."


๐Ÿ’ฌ Add Insight:

"Cache helps avoid expensive DB and network calls."


12. Partitioning & Sharding


๐Ÿ’ฌ Say:

"To handle scale, data must be partitioned."

๐Ÿ’ฌ Example:

"Shard by user_id using consistent hashing. Handle hotspots by splitting heavy users."


13. Async Processing & Messaging


๐Ÿ’ฌ Say:

"Heavy operations are handled asynchronously."

๐Ÿ’ฌ Example:

"Kafka handles fan-out, notifications, and background jobs."


14. Load Balancing & Service Scaling


๐Ÿ’ฌ Say:

"To avoid single points of failure, I'll use load balancing."

๐Ÿ’ฌ Example:

"Multiple load balancers with health checks. Stateless services auto-scale horizontally."


15. Fault Tolerance & Failure Handling


๐Ÿ’ฌ Say:

"Now I'll discuss failure scenarios."

๐Ÿ’ฌ Example:

"If cache fails โ†’ fallback to DB. If DB fails โ†’ replicas take over. If queue fails โ†’ retries ensure eventual completion."


16. Multi-Region & Geo Distribution


๐Ÿ’ฌ Say:

"Since users are global, we need geo-distribution."

๐Ÿ’ฌ Example:

"Deploy across regions. Use geo-routing. Replicate data asynchronously."


17. Security & Privacy


๐Ÿ’ฌ Say:

"Security is critical in production systems."

๐Ÿ’ฌ Example:

"Use JWT/OAuth for authentication. Encrypt data in transit (HTTPS) and at rest. Apply RBAC for access control."


18. Monitoring & Observability


๐Ÿ’ฌ Say:

"We need visibility into system health."

๐Ÿ’ฌ Example:

"Track latency, QPS, error rates. Centralized logging and alerting."


๐Ÿ”ท PHASE 5: EVALUATION

19. Trade-offs Analysis


๐Ÿ’ฌ Say:

"Every design decision has trade-offs."

๐Ÿ’ฌ Example:

"Fan-out on write improves read latency but increases write cost. Hybrid approach balances both."


20. Bottlenecks & Future Improvements


๐Ÿ’ฌ Say:

"Finally, I'll discuss bottlenecks and improvements."

๐Ÿ’ฌ Example:

"Hot users can cause load imbalance. Future improvements: ML-based ranking, smarter caching."