Logo
Published on
1 min read

Designing a Scalable Web Crawler

Authors

🔷 PHASE 1: UNDERSTAND, SCOPE & CONSTRAINTS

1. Clarify Ambiguity + Define Scope

Before designing the architecture, we must define what the crawler is actually responsible for.

💬 Say:

"To ensure the design is focused, I want to clarify the scope. Are we building a general-purpose crawler for all media types, or focusing on HTML? I'll assume we are focusing on HTML pages for now, but the design should be modular enough to support images, videos, and other MIME types in the future. Additionally, I'll assume we are using HTTP as the primary protocol."


2. Functional Requirements

The core logic revolves around the "crawl loop": discovering, fetching, and processing.

💬 Example:

  • Seed Processing: Start with a set of seed URLs.
  • URL Discovery: Extract links from downloaded HTML pages.
  • Fetching: Download pages via HTTP.
  • Robots.txt Compliance: Respect the robots.txt exclusion protocol.
  • Deduplication: Ensure the same URL or content is not processed multiple times.
  • Storage: Store the retrieved pages and metadata.

3. Non-Functional Requirements

These requirements drive the distributed nature of the system.

💬 Say:

"Because we aim to crawl a significant portion of the web, scalability and politeness are our primary non-functional drivers."

  • Scalability: Must be able to fetch hundreds of millions of documents.
  • Extensibility: Modular design to support new document types or protocols.
  • Politeness: Avoid overloading target servers (DoS attack behavior).
  • Fault Tolerance: Ability to resume from checkpoints after a crash.
  • Robustness: Handle "crawler traps" (infinite loops of dynamically generated URLs).

4. Constraints & Assumptions

💬 Example:

  • Target Scale: Crawl 15 billion unique web pages.
  • Timeframe: Complete the crawl within 4 weeks.
  • Page Size: Average HTML page size of 100KB.
  • Metadata: ~500 bytes of metadata per page.

🔷 PHASE 2: ESTIMATION & INTERFACE

5. Capacity Estimation

📈 Traffic & Throughput

To calculate the required pages per second:

  • Total Pages: 15 Billion
  • Time: 4 weeks \approx 2,419,200 seconds
  • Calculation: 15B/2,419,2006,200 pages/sec15\text{B} / 2,419,200 \approx \mathbf{6,200 \text{ pages/sec}}

💾 Storage Estimation

  • Raw Data: 15B pages×(100KB+500B)1.5 Petabytes15\text{B pages} \times (100\text{KB} + 500\text{B}) \approx 1.5 \text{ Petabytes}
  • Buffer for Capacity: Assuming a 70% utilization model to avoid disk saturation: 1.5PB/0.72.14 Petabytes1.5\text{PB} / 0.7 \approx \mathbf{2.14 \text{ Petabytes}}

🧠 Memory Estimation (Deduplication)

To avoid visiting the same URL twice, we store checksums (8 bytes each):

  • URL Checksums: 15B×8 bytes=120 GB15\text{B} \times 8\text{ bytes} = \mathbf{120 \text{ GB}}
  • Content Checksums: 15B×8 bytes=120 GB15\text{B} \times 8\text{ bytes} = \mathbf{120 \text{ GB}}
  • Insight: While 120GB fits in a large server's RAM, a distributed cache (LRU) with persistent backing storage is more reliable.

🔷 PHASE 3: HIGH-LEVEL DESIGN

6. The Core Components

The system follows a recursive loop: Frontier \rightarrow Fetcher \rightarrow Extractor \rightarrow Deduplicator \rightarrow Store.

Component Breakdown:

  1. URL Frontier: A prioritized queue of URLs to be visited.
  2. HTML Fetcher: Handles DNS resolution, robots.txt checks, and HTTP requests.
  3. Extractor: Parses HTML to find new outbound links.
  4. Duplicate Eliminator: Uses checksums to prevent redundant processing.
  5. Datastore: Persistent storage for the page content and metadata.

💬 Architecture Flow:

"The worker thread pulls a URL from the Frontier, resolves the IP via DNS, checks robots.txt, and fetches the page. The content is passed through a dedupe test; if unique, the extractor pulls new URLs, which are filtered and added back to the Frontier."


🔷 PHASE 4: DETAILED DESIGN

7. Distributed URL Frontier & Politeness

To prevent overloading a single server, we cannot simply use a global FIFO queue.

  • Hostname Mapping: Use a hash function to map a hostname to a specific worker thread/server.
  • Sub-queues: Each worker has its own FIFO queue. This ensures that only one thread is hitting a specific domain at a time.
  • Disk Backing: Since the frontier is too large for RAM, enqueue buffers are dumped to disk and dequeue buffers are cached in memory.

8. Optimizing the Fetching Process

  • DNS Caching: DNS resolution is a bottleneck. We implement a local DNS cache to avoid repeated lookups for the same domain.
  • Document Input Stream (DIS): To allow multiple processing modules (e.g., link extractor, indexer) to read the same page without re-downloading, the content is cached in a DIS (In-memory for <64KB<64\text{KB}, disk for larger files).

9. Deduplication Strategies

  • Content Dedupe: Use a 64-bit checksum (MD5/SHA) of the page content.
  • URL Dedupe: Use checksums of canonical URLs.
  • Bloom Filters: To optimize the URL-seen test, we can use a Bloom Filter.
    • Trade-off: A false positive means we skip a page we've never seen. This is generally acceptable in web crawling to save massive amounts of storage/latency.

10. Reliability & Fault Tolerance

  • Consistent Hashing: Used to distribute URLs across crawling servers. If a node fails, only a fraction of the URLs need to be remapped.
  • Checkpointing: Periodically snapshot the state of the Frontier and processed URL sets to disk.
  • Crawler Traps: Implement limits on URL depth and detect cycles to avoid infinite loops generated by dynamic websites.