- Published on
The Steps to Solve System Design Interviews - Complete FAANG Guide
Table of Contents
- ๐ท PHASE 1: UNDERSTAND, SCOPE & CONSTRAINTS
- 1. Clarify Ambiguity + Define Scope
- ๐ฌ Say:
- 2. Functional Requirements
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ฌ Add:
- 3. Non-Functional Requirements
- ๐ฌ Say:
- ๐ฌ Example:
- 4. Constraints & Assumptions
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ท PHASE 2: ESTIMATION & INTERFACE
- 5. Capacity Estimation
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ข Numbers Everyone Should Know (Latency)
- ๐ฌ Say (VERY IMPORTANT):
- ๐ฌ Add Insight:
- Conversions (MEMORIZE THESE)
- ๐ฌ Say:
- 6. API Design
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ฌ Add (IMPORTANT):
- ๐ฌ Explain (interviewer loves this):
- ๐ท PHASE 3: HIGH-LEVEL DESIGN
- 7. High-Level Architecture
- ๐ฌ Say:
- ๐ฌ Add:
- 8. Data Model & Access Patterns
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ฌ Add Insight:
- 9. Database Selection
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ท PHASE 4: DETAILED DESIGN (CORE OF INTERVIEW)
- 10. Core Component Deep Dive
- ๐ฌ Say:
- ๐ฌ Example:
- 11. Caching Strategy (DEEPER)
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ฌ Add Insight:
- 12. Partitioning & Sharding
- ๐ฌ Say:
- ๐ฌ Example:
- 13. Async Processing & Messaging
- ๐ฌ Say:
- ๐ฌ Example:
- 14. Load Balancing & Service Scaling
- ๐ฌ Say:
- ๐ฌ Example:
- 15. Fault Tolerance & Failure Handling
- ๐ฌ Say:
- ๐ฌ Example:
- 16. Multi-Region & Geo Distribution
- ๐ฌ Say:
- ๐ฌ Example:
- 17. Security & Privacy
- ๐ฌ Say:
- ๐ฌ Example:
- 18. Monitoring & Observability
- ๐ฌ Say:
- ๐ฌ Example:
- ๐ท PHASE 5: EVALUATION
- 19. Trade-offs Analysis
- ๐ฌ Say:
- ๐ฌ Example:
- 20. Bottlenecks & Future Improvements
- ๐ฌ Say:
- ๐ฌ Example:
๐ท PHASE 1: UNDERSTAND, SCOPE & CONSTRAINTS
1. Clarify Ambiguity + Define Scope
- What EXACT system?
- What to exclude?
- Who are users?
๐ฌ Say:
"Before jumping into design, I'd like to clarify scope to avoid assumptions. Are we focusing on core features like feed, posting, and interactions, or should we include messaging, ads, and notifications as well? For this discussion, I'll focus on News Feed, posts, likes, and follows, and exclude messaging and ads to keep scope manageable."
2. Functional Requirements
- Core user actions only (5โ7)
๐ฌ Say:
"Let me list the core functional requirements โ the key actions users perform in the system."
๐ฌ Example:
- Create post
- Follow/unfollow users
- View news feed
- Like/comment
- View profile
๐ฌ Add:
"I'm focusing only on the most critical flows to keep the design focused."
3. Non-Functional Requirements
Core:
- Scale (users, QPS)
- Latency (p95/p99)
- Availability (SLA)
- Consistency (strong vs eventual)
- Read vs write ratio
Advanced (often missed):
- Durability
- Fault tolerance
- Reliability
- Cost constraints
๐ฌ Say:
"Now I'll define non-functional requirements โ these drive most design decisions."
๐ฌ Example:
"System should support ~100M DAU. Expected read-heavy workload with ~10:1 read/write ratio. p99 latency should be under 200ms for feed reads. Availability target is 99.99%. Feed can be eventually consistent, but post creation must be strongly consistent. System must be fault-tolerant, durable, and cost-efficient."
4. Constraints & Assumptions
- Unknowns โ make assumptions explicit
๐ฌ Say:
"Since some details are unspecified, I'll make a few reasonable assumptions and call them out."
๐ฌ Example:
"I'll assume average user follows 200 people and opens feed 10 times/day. I'll refine these if needed."
๐ท PHASE 2: ESTIMATION & INTERFACE
5. Capacity Estimation
Traffic:
- DAU
- Requests per user
- Actions per user
- Peak vs avg QPS
Data:
- Payload size
- Storage/day/year
Bandwidth:
- Ingress/egress
Growth:
- 2โ3 year projection
๐ฌ Say:
"Now I'll do back-of-the-envelope calculations to estimate scale."
๐ฌ Example:
"100M DAU ร 10 feed loads = 1B requests/day That's ~10โน requests/day โ 100K peak QPS.
200M posts/day ร 1KB = 200 ร 10โน bytes โ 200 GB/day.
Over 1 year: 200 GB ร 365 โ 73 TB/year.
Media is much larger, so we store it in object storage and serve via CDN."
๐ข Numbers Everyone Should Know (Latency)
๐ฌ Say (VERY IMPORTANT):
"I'll use standard latency numbers to justify caching and storage decisions."
- L1 cache: 0.5 ns
- RAM: 100 ns
- SSD (1MB read): 1 ms
- Network (1MB): 10 ms
- Cross-region RTT: 150 ms
๐ฌ Add Insight:
"Since network calls are orders of magnitude slower than memory, caching is critical to reduce latency."
Conversions (MEMORIZE THESE)
| Name | Power | Value |
|---|---|---|
| Thousand | 10ยณ | 1K |
| Million | 10โถ | 1M |
| Billion | 10โน | 1B |
| Trillion | 10ยนยฒ | 1T |
๐ฌ Say:
"I'll use powers of 10 for quick estimation."
6. API Design
- REST/gRPC/GraphQL
- Pagination
- Idempotency
- Versioning
๐ฌ Say:
"Next, I'll define API contracts between client and backend."
๐ฌ Example:
POST /v1/posts
GET /v1/feed?cursor=abc&limit=20
POST /v1/like
๐ฌ Add (IMPORTANT):
"APIs are versioned for backward compatibility and designed to be idempotent where needed."
๐ฌ Explain (interviewer loves this):
"Idempotency ensures retries don't create duplicate actions. For example, POST requests can use an idempotency key to prevent duplicate writes."
๐ท PHASE 3: HIGH-LEVEL DESIGN
7. High-Level Architecture
- End-to-end flow
- Service separation
๐ฌ Say:
"Now I'll walk through the high-level architecture."
"Client โ CDN โ Load Balancer โ API Gateway โ Microservices โ Cache โ Database โ Queue"
๐ฌ Add:
"Services are stateless and horizontally scalable."
8. Data Model & Access Patterns
- Entities
- Query patterns (CRITICAL)
๐ฌ Say:
"Before choosing databases, I'll define data model and access patterns."
๐ฌ Example:
"The most critical query is: 'Fetch latest posts from followed users sorted by time.'"
๐ฌ Add Insight:
"Access patterns drive database choice โ not the other way around."
9. Database Selection
๐ฌ Say:
"Based on access patterns, I'll choose appropriate databases."
๐ฌ Example:
"Cassandra for feed (high write throughput, time-series data). Relational DB for user relationships. This is a polyglot persistence approach."
๐ท PHASE 4: DETAILED DESIGN (CORE OF INTERVIEW)
10. Core Component Deep Dive
๐ฌ Say:
"Now I'll deep dive into the most complex component โ the feed system."
๐ฌ Example:
"I'll use a hybrid fan-out approach: Fan-out on write for normal users and fan-out on read for celebrities."
11. Caching Strategy (DEEPER)
๐ฌ Say:
"Caching is critical to reduce latency and database load."
๐ฌ Example:
"Redis stores precomputed feeds. Cache-aside pattern is used. TTL ensures freshness. Hot users may have shorter TTL."
๐ฌ Add Insight:
"Cache helps avoid expensive DB and network calls."
12. Partitioning & Sharding
๐ฌ Say:
"To handle scale, data must be partitioned."
๐ฌ Example:
"Shard by user_id using consistent hashing. Handle hotspots by splitting heavy users."
13. Async Processing & Messaging
๐ฌ Say:
"Heavy operations are handled asynchronously."
๐ฌ Example:
"Kafka handles fan-out, notifications, and background jobs."
14. Load Balancing & Service Scaling
๐ฌ Say:
"To avoid single points of failure, I'll use load balancing."
๐ฌ Example:
"Multiple load balancers with health checks. Stateless services auto-scale horizontally."
15. Fault Tolerance & Failure Handling
๐ฌ Say:
"Now I'll discuss failure scenarios."
๐ฌ Example:
"If cache fails โ fallback to DB. If DB fails โ replicas take over. If queue fails โ retries ensure eventual completion."
16. Multi-Region & Geo Distribution
๐ฌ Say:
"Since users are global, we need geo-distribution."
๐ฌ Example:
"Deploy across regions. Use geo-routing. Replicate data asynchronously."
17. Security & Privacy
๐ฌ Say:
"Security is critical in production systems."
๐ฌ Example:
"Use JWT/OAuth for authentication. Encrypt data in transit (HTTPS) and at rest. Apply RBAC for access control."
18. Monitoring & Observability
๐ฌ Say:
"We need visibility into system health."
๐ฌ Example:
"Track latency, QPS, error rates. Centralized logging and alerting."
๐ท PHASE 5: EVALUATION
19. Trade-offs Analysis
๐ฌ Say:
"Every design decision has trade-offs."
๐ฌ Example:
"Fan-out on write improves read latency but increases write cost. Hybrid approach balances both."
20. Bottlenecks & Future Improvements
๐ฌ Say:
"Finally, I'll discuss bottlenecks and improvements."
๐ฌ Example:
"Hot users can cause load imbalance. Future improvements: ML-based ranking, smarter caching."