👽
Software Engineer Interview Handbook
  • README
  • Behavioral
    • Useful Links
    • Dongze Li
  • Algorithm
    • Segment Tree
    • Array
      • Product Of Array Except Self
      • Merge Strings Alternately
      • Increasing Triplet Subsequence
      • String Compression
      • Greatest Common Divisor Strings
      • Max Product Of Three
      • Find Duplicate Num
      • Valid Palindrome Ii
      • Next Permutation
      • Rearrange Array By Sign
      • Removing Min Max Elements
      • Find Original Array From Doubled
      • Reverse Words Ii
    • Backtracking
      • Letter Combination Phone Number
      • Combination Sum Iii
      • N Queens
      • Permutations
      • Combination Sum
    • Binary Search
      • Koko Eating Bananas
      • Find Peak Element
      • Successful Pairs Of Spells Potions
    • Binary Search Tree
      • Delete Node In BST
      • Validate Bst
      • Range Sum Bst
    • Binary Tree
      • Maximum Depth
      • Leaf Similar Trees
      • Maximum Level Sum
      • Binary Tree Right Side
      • Lowest Common Ancestor
      • Longest Zigzag Path
      • Count Good Nodes
      • Path Sum III
      • Maximum Path Sum
      • Move Zero
      • Diameter Binary Tree
      • Sum Root Leaf Number
      • Traversal
      • Binary Tree Vertical Order
      • Height Tree Removal Queries
      • Count Nodes Avg Subtree
      • Distribute Coins
      • Binary Tree Max Path Sum
    • Bit
      • Min Flips
      • Single Number
      • Pow
      • Find Unique Binary Str
    • BFS
      • Rotten Oranges
      • Nearest Exist From Entrance
      • Minimum Knight Moves
      • Network Delay Time
      • Minimum Height Tree
      • Knight Probability In Board
    • Design
      • LRU Cache
      • Get Random
      • LFU Cache
      • Moving Average
      • Rle Iterator
      • Design Hashmap
    • DFS
      • Reorder Routes Lead City
      • Evaluate Division
      • Keys And Rooms
      • Number Of Provinces
      • Disconnected Path With One Flip
      • Course Schedule Ii
      • Robot Room Cleaner
      • Word Break Ii
      • Number Coins In Tree Nodes
      • Maximum Increasing Cells
      • Number Coins In Tree Nodes
      • Detonate Maximum Bombs
      • Find All Possible Recipes
      • Min Fuel Report Capital
      • Similar String Groups
    • DP
      • Domino And Tromino Tiling
      • House Robber
      • Longest Common Subsequence
      • Trade Stock With Transaction Fee
      • Buy And Sell Stock
      • Longest Non Decreasing Subarray
      • Number Of Good Binary Strings
      • Delete And Earn
      • Minimum Costs Using Train Line
      • Decode Ways
      • Trapping Rain Water
      • Count Fertile Pyramids
      • Minimum Time Finish Race
      • Knapsack
      • Count Unique Char Substrs
      • Count All Valid Pickup
    • Greedy
      • Dota2 Senate
      • Smallest Range Ii
      • Can Place Flowers
      • Meeting Rooms II
      • Guess the word
      • Minimum Replacement
      • Longest Palindrome Two Letter Words
      • Parentheses String Valid
      • Largest Palindromic Num
      • Find Missing Observations
      • Most Profit Assigning Work
    • Hashmap
      • Equal Row Column Pairs
      • Two Strings Close
      • Group Anagrams
      • Detect Squares
    • Heap
      • Maximum Subsequence Score
      • Smallest Number Infinite Set
      • Total Cost Hire Workers
      • Kth Largest Element
      • Meeting Rooms III
      • K Closest Points Origin
      • Merge K Sorted List
      • Top K Frequent Elements
      • Meeting Room III
      • Num Flowers Bloom
      • Find Median From Stream
    • Intervals
      • Non Overlapping Intervals
      • Min Arrows Burst Ballons
    • Linkedlist
      • Reverse Linked List
      • Delete Middle Node
      • Odd Even Linkedlist
      • Palindrome Linkedlist
    • Monotonic Stack
      • Daily Temperatures
      • Online Stock Span
    • Random
      • Random Pick With Weight
      • Random Pick Index
      • Shuffle An Array
    • Recursion
      • Difference Between Two Objs
    • Segment Fenwick
      • Longest Increasing Subsequence II
    • Stack
      • Removing Stars From String
      • Asteroid Collision
      • Evaluate Reverse Polish Notation
      • Building With Ocean View
      • Min Remove Parentheses
      • Basic Calculator Ii
      • Simplify Path
      • Min Add Parentheses
    • Prefix Sum
      • Find The Highest Altitude
      • Find Pivot Index
      • Subarray Sum K
      • Range Addition
    • Sliding Window
      • Max Vowels Substring
      • Max Consecutive Ones III
      • Longest Subarray Deleting Element
      • Minimum Window Substring
      • K Radius Subarray Averages
    • String
      • Valid Word Abbreviations
    • Two Pointers
      • Container With Most Water
      • Max Number K Sum Pairs
      • Is Subsequence
      • Num Substrings Contains Three Char
    • Trie
      • Prefix Tree
      • Search Suggestions System
      • Design File System
    • Union Find
      • Accounts Merge
    • Multithreading
      • Basics
      • Web Crawler
  • System Design
    • Operating System
    • Mocks
      • Design ChatGPT
      • Design Web Crawler
      • Distributed Search
      • News Feed Search
      • Top K / Ad Click Aggregation
      • Design Job Scheduler
      • Distributed Message Queue
      • Google Maps
      • Nearby Friends
      • Proximity Service
      • Metrics monitoring and alert system
      • Design Email
      • Design Gaming Leaderboard
      • Facebook New Feed Live Comments
      • Dog Sitting App
      • Design Chat App (WhatsApp)
      • Design Youtube/Netflix
      • Design Google Doc
      • Design Webhook
      • Validate Instacart Shopper Checkout
      • Design Inventory
      • Design donation app
      • Design Twitter
    • Deep-Dive
      • Back of Envelope
      • Message Queue
      • Redis Sorted Set
      • FAQ
      • Geohash
      • Quadtree
      • Redis Pub/Sub
      • Cassandra DB
      • Collaborative Concurrency Control
      • Websocket / Long Polling / SSE
    • DDIA
      • Chapter 2: Data Models and Query Languages
      • Chapter 5: Replication
      • Chapter 9: Consistency and Consensus
  • OOD
    • Overview
    • Design Parking
  • Company Tags
    • Meta
    • Citadel
      • C++ Fundamentals
      • 面经1
      • Fibonacci
      • Pi
      • Probability
    • DoorDash
      • Similar String Groups
      • Door And Gates
      • Max Job Profit
      • Design File System
      • Count All Valid Pickup
      • Most Profit Assigning Work
      • Swap
      • Binary Tree Max Path Sum
      • Nearest Cities
      • Exployee Free Time
      • Tree Add Removal
    • Lyft
      • Autocomplete
      • Job Scheduler
      • Read4
      • Kvstore
    • Amazon
      • Min Binary Str Val
    • AppLovin
      • TODO
      • Java Basic Questions
    • Google
      • Huffman Tree
      • Unique Elements
    • Instacart
      • Meeting Rooms II
      • Pw
      • Pw2
      • Pw3
      • Expression1
      • Expression2
      • Expression3
      • PW All
      • Expression All
      • Wildcard
      • Free forum tech discussion
    • OpenAI
      • Spreadsheet
      • Iterator
      • Kv Store
    • Rabbit
      • Scheduler
      • SchedulerC++
    • [Microsoft]
      • Min Moves Spread Stones
      • Inorder Successor
      • Largest Palindromic Num
      • Count Unique Char Substrs
      • Reverse Words Ii
      • Find Missing Observations
      • Min Fuel Report Capital
      • Design Hashmap
      • Find Original Array From Doubled
      • Num Flowers Bloom
      • Distribute Coins
      • Find Median From Stream
Powered by GitBook
On this page
  • Web Crawler Study Link
  • Web Crawler Medium Link
  • Topics
  • Functional Requirements
  • Non-functional Requirements
  • Scale
  • Traversal time
  • API
  • High Level Design
  • E2E
  • Deep dive
  • How we get the seed URLs?
  • Why we need DNS resolver:
  • Crawler Traps
  • Robot.txt
  • URL Frontier
  1. System Design
  2. Mocks

Design Web Crawler

PreviousDesign ChatGPTNextDistributed Search

Last updated 1 year ago

Topics

  1. Politeness / Crawl Rate

  2. DNS query

  3. Distributed Crawling

  4. Priority Crawling

  5. Duplicate detection

Functional Requirements

  1. Crawling: The system should scrape www, spanning from a queue of seed URLs provided initially by the system admin. Two ways to get them: 1. manually create 2. scan IP address of web servers

  2. Storing: The system should be able to extract and store content of a URL in a blob store. Make it processable by search engine for indexing and ranking.

  3. Scheduling: Since crawling is a process that's repeated, the system should have regular scheduling to update its blob stores records.

  4. Throttling on crawl speed / Politeness?

  5. Crawling based on priority?

  6. Deduplicate urls.

  7. Do we think about search?

Non-functional Requirements

  1. Scalability: fetch hundreds of web documents.

  2. Consistency: multiple crawling workers, having data consistency among all of them is necessary.

  3. Performance: self throttling its crawling to a certain domain.

Scale

  1. 5 B web pages

  2. text content per page is 2000 KB.

  3. metadata for one web page is 500B.

Total storage: 5B x (2000kb + 500b) = 10PB

Traversal time

total = 5 Billion x 60ms = 0.3B seconds = 9.5 year

1 server = 3468 days.

we need 3468 servers to finish it in a day.

API

CRUD

/v1/crawl/seeds POST
json {
  "urls": [url1, url2...],
}

/v1/crawl?url=xxx PUT
Response: 201

/v1/search?query=hello+world GET

High Level Design

Scheduler: used to schedule crawling events on URLs that are stored in its database.

  • Priority Queue (URL frontier): The queue hosts URLs that are made ready for crawling based on priority and update frequency.

  • RDBMS: store urls with priority and update(recrawl) frequency.

E2E

  1. Assignment to a worker The crawler (service host) initiates the process by loading a URL from scheduler's priority queue and assigns it to the available worker.

  2. DNS resolution The worker sends the incoming URL for DNS resolution. DNS resolver checks the cache and returns the requested IP address if it's found. Otherwise, it determines the IP address, sends it back to the worker instance of the crawler and store the result in the cache.

  3. Communication by HTML fetcher initiate communication between the crawler and host server.

  4. Content extraction Once html fetcher get the web page, the next step would be to extract URLs and content from the webpage. The extractor sends the extracted URLs and content with document input stream (DIS) to duplicate eliminator. Once verified no duplicates, extractor store content in blob storage and send URL to URL frontier for recrawl.

  5. Dedup testing Duplicate eliminator calculates and compares checksum of both URL and document with checksum values that stored in DB. It discards the incoming request in case of a match. If no match, it places new calculated checksum value in DB and gives go ahead to extractor.

Calculating checksum word by word for whole document is probably very time consuming

  1. Content storing The extractor sends the newly discovered URLs to scheduler. Save then in DB and sets value for priority.

  2. Recrawling Once a cycle is complete, the crawler goes back to first point and repeats the same process until URL frontier query is empty.

Deep dive

How we get the seed URLs?

  1. manually creat them

  2. scan IP addresses for the presence of the web servers

seed URLs must be of good quality.

Why we need DNS resolver:

  1. We are crawling using a library but not from browser, avoid additional network hops.

  2. DNS resolver is synchronous and not work with multi-thread worker architecture.

Crawler Traps

a set of URLs that cause indefinite crawler resource exhaustion.

  1. URLs with query parameters

  2. URLs with internal links

  3. URLs with dynamic content generation

  4. URLs with repeated cyclic directories.

we need to identify them:

  1. Analyze URL scheme, if URL is poor structured, we skip.

http://www.abc.com/first/second/first/second/...
  1. Analyze the total number of web pages against a domain, set a cap.

Robot.txt

This file contains DOs and DONTs for a crawler listed by web owner. This is called robots exclusion protocol. We need to respect it to only crawl only the allowed urls.

URL Frontier

Front queues are for priorities, Back queues for politeness.

  1. Priority

How to set priority:

  • How frequently the site is changing?

  1. Politeness

Put the url from same site onto same back queue to make sure we don't overwhelm it.

One back queue is associated with one worker thread.

  • Instad of using static crawl speed for every domain, we can adjust crawl speed based on domain's time to first byte (TTFB) value. The higher the value, the slower the server is so we need to crawl slower.

  1. Freshness

Partition metadata table

Host
Back queue ID

digit.com

5

techcrunch.com

1

youtube.com

17

and (a method of seeing how close two sets are), (searching to see if the documents are roughly similar)

Web Crawler Study Link
Web Crawler Medium Link
minhash
simhash
fuzzy search
Drawing
Drawing