- Published on
Data Compression vs Deduplication - System Design Interview Guide
Table of Contents
- ?? 2. Data Compression (High Signal)
- ? What it does
- ?? Key Insight
- ??? Architecture Signals (Use Compression when)
- ? Problems
- ?? FAANG Q&A
- ?? Script
- ?? 3. Data Deduplication (High Signal)
- ? What it does
- ?? Key Insight
- ??? Architecture Signals (Use Deduplication when)
- ? Problems
- ?? FAANG Q&A
- ?? Script
- ?? 4. Key Differences (Interview Table)
- ?? Script
- ?? 5. Real Architecture Decision (FAANG Level)
- ? Use BOTH together (Very Important)
- ?? Why?
- ?? FAANG Q&A
- ?? Script
- ?? 6. Strong Signals vs Weak Signals
- ?? Choose Compression (Strong Signals)
- ?? Choose Deduplication (Strong Signals)
- ?? Weak Signals
- ?? 7. FAANG-Level Insights (Added)
- ?? 1. Chunking Strategy (Dedup)
- ?? 2. Content Addressable Storage
- ?? 3. Inline vs Post-process Dedup
- ?? 4. Compression Algorithms
- ?? Final Ultra-Short Summary
- Compression ? Shrinks data inside a file
- Deduplication ? Removes duplicate data across files
?? Script
�Compression reduces size within data. Deduplication removes repeated data across the system.�
?? 2. Data Compression (High Signal)
? What it does
-
Encodes data using fewer bits
-
Works within a single file/stream
-
Types:
- Lossless (exact recovery)
- Lossy (some data removed)
?? Key Insight
?? Optimizes storage + network bandwidth
??? Architecture Signals (Use Compression when)
- Need faster network transfer (CDN, APIs)
- Storing large files (images, videos, logs)
- Bandwidth is expensive
- Real-time systems need smaller payloads
? Problems
- CPU overhead (compress/decompress)
- Lossy ? quality degradation
?? FAANG Q&A
Q1: Why gzip APIs? ?? Reduce payload ? faster response time.
Q2: When avoid compression? ?? Already compressed data (JPEG, MP4).
Q3: Tradeoff? ?? CPU vs bandwidth.
?? Script
�I use compression to reduce storage and network cost. It works at the file level and is useful for transmission efficiency.�
?? 3. Data Deduplication (High Signal)
? What it does
- Removes duplicate blocks/files
- Stores one copy + references
- Works across entire system
?? Key Insight
?? Optimizes storage at scale
??? Architecture Signals (Use Deduplication when)
- Backup systems (daily snapshots)
- Cloud storage (same files repeated)
- Logs / documents with redundancy
- Large-scale storage (TB�PB)
? Problems
- Needs hashing/indexing ? CPU/memory cost
- Only works for identical data
?? FAANG Q&A
Q1: Why dedup in backups? ?? Same files repeated ? huge storage savings.
Q2: File-level vs block-level dedup?
- File ? simple, less efficient
- Block ? complex, more savings
Q3: Tradeoff? ?? Storage saved vs compute overhead.
?? Script
�I use deduplication in large-scale storage systems to eliminate redundant data and save space across datasets.�
?? 4. Key Differences (Interview Table)
| Aspect | Compression | Deduplication |
|---|---|---|
| Scope | Within file | Across system |
| Method | Encode efficiently | Remove duplicates |
| Use Case | Network + storage | Storage optimization |
| Data Type | Any data | Only identical data |
| Recovery | Decompress | Use references |
| CPU Cost | Medium | High (hashing/indexing) |
?? Script
�Compression reduces redundancy within data, while deduplication removes redundancy across data.�
?? 5. Real Architecture Decision (FAANG Level)
? Use BOTH together (Very Important)
Upload ? Compression (reduce size)
Storage ? Deduplication (remove duplicates)
?? Why?
- Compression ? saves bandwidth
- Dedup ? saves storage
?? FAANG Q&A
Q: Which comes first? ?? Dedup BEFORE compression (important!)
Why? ?? Compression changes data ? duplicates become unrecognizable.
?? Script
�In real systems, I first deduplicate to remove duplicates, then compress to reduce size further.�
?? 6. Strong Signals vs Weak Signals
?? Choose Compression (Strong Signals)
- �Reduce API payload�
- �Improve latency�
- �Streaming / CDN�
- �Bandwidth optimization�
?? Choose Deduplication (Strong Signals)
- �Backup system�
- �Repeated files/data�
- �Storage cost problem�
- �Large-scale storage�
?? Weak Signals
- �Big data ? use compression only� ?
- �Storage issue ? always dedup� ?
?? 7. FAANG-Level Insights (Added)
?? 1. Chunking Strategy (Dedup)
- Fixed-size vs variable-size chunks ?? Variable = better dedup ratio
?? 2. Content Addressable Storage
- Hash ? data identity (used in Git, S3-like systems)
?? 3. Inline vs Post-process Dedup
- Inline ? during write (slow, real-time)
- Post-process ? later (faster ingestion)
?? 4. Compression Algorithms
- gzip (general)
- Snappy (fast, less compression)
- LZ4 (real-time systems)
?? Final Ultra-Short Summary
?? Golden Line
�Compression shrinks data. Deduplication eliminates repetition.�
If you want next, I can combine SQL vs NoSQL + Replication + Compression + Gateway decisions into ONE master system design template (FAANG-ready).