Compression ? Shrinks data inside a file
Deduplication ? Removes duplicate data across files

?? Script

�Compression reduces size within data. Deduplication removes repeated data across the system.�

?? 2. Data Compression (High Signal)

? What it does

Encodes data using fewer bits
Works within a single file/stream
Types:
- Lossless (exact recovery)
- Lossy (some data removed)

?? Key Insight

?? Optimizes storage + network bandwidth

??? Architecture Signals (Use Compression when)

Need faster network transfer (CDN, APIs)
Storing large files (images, videos, logs)
Bandwidth is expensive
Real-time systems need smaller payloads

? Problems

CPU overhead (compress/decompress)
Lossy ? quality degradation

?? FAANG Q&A

Q1: Why gzip APIs? ?? Reduce payload ? faster response time.

Q2: When avoid compression? ?? Already compressed data (JPEG, MP4).

Q3: Tradeoff? ?? CPU vs bandwidth.

?? Script

�I use compression to reduce storage and network cost. It works at the file level and is useful for transmission efficiency.�

?? 3. Data Deduplication (High Signal)

? What it does

Removes duplicate blocks/files
Stores one copy + references
Works across entire system

?? Key Insight

?? Optimizes storage at scale

??? Architecture Signals (Use Deduplication when)

Backup systems (daily snapshots)
Cloud storage (same files repeated)
Logs / documents with redundancy
Large-scale storage (TB�PB)

? Problems

Needs hashing/indexing ? CPU/memory cost
Only works for identical data

?? FAANG Q&A

Q1: Why dedup in backups? ?? Same files repeated ? huge storage savings.

Q2: File-level vs block-level dedup?

File ? simple, less efficient
Block ? complex, more savings

Q3: Tradeoff? ?? Storage saved vs compute overhead.

?? Script

�I use deduplication in large-scale storage systems to eliminate redundant data and save space across datasets.�

?? 4. Key Differences (Interview Table)

Aspect	Compression	Deduplication
Scope	Within file	Across system
Method	Encode efficiently	Remove duplicates
Use Case	Network + storage	Storage optimization
Data Type	Any data	Only identical data
Recovery	Decompress	Use references
CPU Cost	Medium	High (hashing/indexing)

?? Script

�Compression reduces redundancy within data, while deduplication removes redundancy across data.�

?? 5. Real Architecture Decision (FAANG Level)

? Use BOTH together (Very Important)

Upload ? Compression (reduce size)
Storage ? Deduplication (remove duplicates)

?? Why?

Compression ? saves bandwidth
Dedup ? saves storage

?? FAANG Q&A

Q: Which comes first? ?? Dedup BEFORE compression (important!)

Why? ?? Compression changes data ? duplicates become unrecognizable.

?? Script

�In real systems, I first deduplicate to remove duplicates, then compress to reduce size further.�

?? 6. Strong Signals vs Weak Signals

?? Choose Compression (Strong Signals)

�Reduce API payload�
�Improve latency�
�Streaming / CDN�
�Bandwidth optimization�

?? Choose Deduplication (Strong Signals)

�Backup system�
�Repeated files/data�
�Storage cost problem�
�Large-scale storage�

?? Weak Signals

�Big data ? use compression only� ?
�Storage issue ? always dedup� ?

?? 7. FAANG-Level Insights (Added)

?? 1. Chunking Strategy (Dedup)

Fixed-size vs variable-size chunks ?? Variable = better dedup ratio

?? 2. Content Addressable Storage

Hash ? data identity (used in Git, S3-like systems)

?? 3. Inline vs Post-process Dedup

Inline ? during write (slow, real-time)
Post-process ? later (faster ingestion)

?? 4. Compression Algorithms

gzip (general)
Snappy (fast, less compression)
LZ4 (real-time systems)

?? Final Ultra-Short Summary

?? Golden Line

�Compression shrinks data. Deduplication eliminates repetition.�

If you want next, I can combine SQL vs NoSQL + Replication + Compression + Gateway decisions into ONE master system design template (FAANG-ready).

Data Compression vs Deduplication - System Design Interview Guide

Table of Contents

?? 2. Data Compression (High Signal)

? What it does

?? Key Insight

??? Architecture Signals (Use Compression when)

? Problems

?? FAANG Q&A

?? Script

?? 3. Data Deduplication (High Signal)

? What it does

?? Key Insight

??? Architecture Signals (Use Deduplication when)

? Problems

?? FAANG Q&A

?? Script

?? 4. Key Differences (Interview Table)

?? Script

?? 5. Real Architecture Decision (FAANG Level)

? Use BOTH together (Very Important)

?? Why?

?? FAANG Q&A

?? Script

?? 6. Strong Signals vs Weak Signals

?? Choose Compression (Strong Signals)

?? Choose Deduplication (Strong Signals)

?? Weak Signals

?? 7. FAANG-Level Insights (Added)

?? 1. Chunking Strategy (Dedup)

?? 2. Content Addressable Storage

?? 3. Inline vs Post-process Dedup

?? 4. Compression Algorithms

?? Final Ultra-Short Summary