👽
Software Engineer Interview Handbook
  • README
  • Behavioral
    • Useful Links
    • Dongze Li
  • Algorithm
    • Segment Tree
    • Array
      • Product Of Array Except Self
      • Merge Strings Alternately
      • Increasing Triplet Subsequence
      • String Compression
      • Greatest Common Divisor Strings
      • Max Product Of Three
      • Find Duplicate Num
      • Valid Palindrome Ii
      • Next Permutation
      • Rearrange Array By Sign
      • Removing Min Max Elements
      • Find Original Array From Doubled
      • Reverse Words Ii
    • Backtracking
      • Letter Combination Phone Number
      • Combination Sum Iii
      • N Queens
      • Permutations
      • Combination Sum
    • Binary Search
      • Koko Eating Bananas
      • Find Peak Element
      • Successful Pairs Of Spells Potions
    • Binary Search Tree
      • Delete Node In BST
      • Validate Bst
      • Range Sum Bst
    • Binary Tree
      • Maximum Depth
      • Leaf Similar Trees
      • Maximum Level Sum
      • Binary Tree Right Side
      • Lowest Common Ancestor
      • Longest Zigzag Path
      • Count Good Nodes
      • Path Sum III
      • Maximum Path Sum
      • Move Zero
      • Diameter Binary Tree
      • Sum Root Leaf Number
      • Traversal
      • Binary Tree Vertical Order
      • Height Tree Removal Queries
      • Count Nodes Avg Subtree
      • Distribute Coins
      • Binary Tree Max Path Sum
    • Bit
      • Min Flips
      • Single Number
      • Pow
      • Find Unique Binary Str
    • BFS
      • Rotten Oranges
      • Nearest Exist From Entrance
      • Minimum Knight Moves
      • Network Delay Time
      • Minimum Height Tree
      • Knight Probability In Board
    • Design
      • LRU Cache
      • Get Random
      • LFU Cache
      • Moving Average
      • Rle Iterator
      • Design Hashmap
    • DFS
      • Reorder Routes Lead City
      • Evaluate Division
      • Keys And Rooms
      • Number Of Provinces
      • Disconnected Path With One Flip
      • Course Schedule Ii
      • Robot Room Cleaner
      • Word Break Ii
      • Number Coins In Tree Nodes
      • Maximum Increasing Cells
      • Number Coins In Tree Nodes
      • Detonate Maximum Bombs
      • Find All Possible Recipes
      • Min Fuel Report Capital
      • Similar String Groups
    • DP
      • Domino And Tromino Tiling
      • House Robber
      • Longest Common Subsequence
      • Trade Stock With Transaction Fee
      • Buy And Sell Stock
      • Longest Non Decreasing Subarray
      • Number Of Good Binary Strings
      • Delete And Earn
      • Minimum Costs Using Train Line
      • Decode Ways
      • Trapping Rain Water
      • Count Fertile Pyramids
      • Minimum Time Finish Race
      • Knapsack
      • Count Unique Char Substrs
      • Count All Valid Pickup
    • Greedy
      • Dota2 Senate
      • Smallest Range Ii
      • Can Place Flowers
      • Meeting Rooms II
      • Guess the word
      • Minimum Replacement
      • Longest Palindrome Two Letter Words
      • Parentheses String Valid
      • Largest Palindromic Num
      • Find Missing Observations
      • Most Profit Assigning Work
    • Hashmap
      • Equal Row Column Pairs
      • Two Strings Close
      • Group Anagrams
      • Detect Squares
    • Heap
      • Maximum Subsequence Score
      • Smallest Number Infinite Set
      • Total Cost Hire Workers
      • Kth Largest Element
      • Meeting Rooms III
      • K Closest Points Origin
      • Merge K Sorted List
      • Top K Frequent Elements
      • Meeting Room III
      • Num Flowers Bloom
      • Find Median From Stream
    • Intervals
      • Non Overlapping Intervals
      • Min Arrows Burst Ballons
    • Linkedlist
      • Reverse Linked List
      • Delete Middle Node
      • Odd Even Linkedlist
      • Palindrome Linkedlist
    • Monotonic Stack
      • Daily Temperatures
      • Online Stock Span
    • Random
      • Random Pick With Weight
      • Random Pick Index
      • Shuffle An Array
    • Recursion
      • Difference Between Two Objs
    • Segment Fenwick
      • Longest Increasing Subsequence II
    • Stack
      • Removing Stars From String
      • Asteroid Collision
      • Evaluate Reverse Polish Notation
      • Building With Ocean View
      • Min Remove Parentheses
      • Basic Calculator Ii
      • Simplify Path
      • Min Add Parentheses
    • Prefix Sum
      • Find The Highest Altitude
      • Find Pivot Index
      • Subarray Sum K
      • Range Addition
    • Sliding Window
      • Max Vowels Substring
      • Max Consecutive Ones III
      • Longest Subarray Deleting Element
      • Minimum Window Substring
      • K Radius Subarray Averages
    • String
      • Valid Word Abbreviations
    • Two Pointers
      • Container With Most Water
      • Max Number K Sum Pairs
      • Is Subsequence
      • Num Substrings Contains Three Char
    • Trie
      • Prefix Tree
      • Search Suggestions System
      • Design File System
    • Union Find
      • Accounts Merge
    • Multithreading
      • Basics
      • Web Crawler
  • System Design
    • Operating System
    • Mocks
      • Design ChatGPT
      • Design Web Crawler
      • Distributed Search
      • News Feed Search
      • Top K / Ad Click Aggregation
      • Design Job Scheduler
      • Distributed Message Queue
      • Google Maps
      • Nearby Friends
      • Proximity Service
      • Metrics monitoring and alert system
      • Design Email
      • Design Gaming Leaderboard
      • Facebook New Feed Live Comments
      • Dog Sitting App
      • Design Chat App (WhatsApp)
      • Design Youtube/Netflix
      • Design Google Doc
      • Design Webhook
      • Validate Instacart Shopper Checkout
      • Design Inventory
      • Design donation app
      • Design Twitter
    • Deep-Dive
      • Back of Envelope
      • Message Queue
      • Redis Sorted Set
      • FAQ
      • Geohash
      • Quadtree
      • Redis Pub/Sub
      • Cassandra DB
      • Collaborative Concurrency Control
      • Websocket / Long Polling / SSE
    • DDIA
      • Chapter 2: Data Models and Query Languages
      • Chapter 5: Replication
      • Chapter 9: Consistency and Consensus
  • OOD
    • Overview
    • Design Parking
  • Company Tags
    • Meta
    • Citadel
      • C++ Fundamentals
      • 面经1
      • Fibonacci
      • Pi
      • Probability
    • DoorDash
      • Similar String Groups
      • Door And Gates
      • Max Job Profit
      • Design File System
      • Count All Valid Pickup
      • Most Profit Assigning Work
      • Swap
      • Binary Tree Max Path Sum
      • Nearest Cities
      • Exployee Free Time
      • Tree Add Removal
    • Lyft
      • Autocomplete
      • Job Scheduler
      • Read4
      • Kvstore
    • Amazon
      • Min Binary Str Val
    • AppLovin
      • TODO
      • Java Basic Questions
    • Google
      • Huffman Tree
      • Unique Elements
    • Instacart
      • Meeting Rooms II
      • Pw
      • Pw2
      • Pw3
      • Expression1
      • Expression2
      • Expression3
      • PW All
      • Expression All
      • Wildcard
      • Free forum tech discussion
    • OpenAI
      • Spreadsheet
      • Iterator
      • Kv Store
    • Rabbit
      • Scheduler
      • SchedulerC++
    • [Microsoft]
      • Min Moves Spread Stones
      • Inorder Successor
      • Largest Palindromic Num
      • Count Unique Char Substrs
      • Reverse Words Ii
      • Find Missing Observations
      • Min Fuel Report Capital
      • Design Hashmap
      • Find Original Array From Doubled
      • Num Flowers Bloom
      • Distribute Coins
      • Find Median From Stream
Powered by GitBook
On this page
  • Webhook from hookdeck
  • Retry from Uber
  • Topics
  • Qs:
  • Functional Requirement:
  • Non-functional Requirements:
  • Scale
  • API
  • Schema
  • High Level Diagram
  • E2E
  • Deep Dive
  • How to make sure request succeed?
  • How to handle failure?
  • How to do retry?
  • What's wrong with simple retries?
  • How to verify webhook url ownership?
  • How to make sure webhook request are secure?
  • What happen if retry doesn't work?
  • How to scale?
  • Why we need extra logging and monitoring?
  • Security issue?
  • Deliverability
  1. System Design
  2. Mocks

Design Webhook

PreviousDesign Google DocNextValidate Instacart Shopper Checkout

Last updated 1 year ago

Topics

  1. How to handle failure?

  2. How to do retry?

  3. How to scale?

  4. How to validate webhook url ownership?

  5. How to make sure webhook request are secure?

  6. Rate limiting and VPC before sending the request?

  7. Why to use Kafka vs SQS? Pros and Cons?

  8. How to filter different event types, can we use Topic from Kafka?

Qs:

  1. What delivery semantic do we provide?

    1. At least once

    2. At most once

    3. Exactly once

  2. If we resend webhook, could we assume the endpoint to be idempotent?

  3. Are we designing 1st party webhooks or 3rd party webhooks? This is critical since it determines the webhook triggering mechanism. If can be either triggered by us, or triggered by our client.

  4. Do we support canceling the webhook?

Functional Requirement:

  1. Customer could register multiple webhook

  2. Verify ownership of webhook url

  3. Security validation for webhook

  4. Filtering by event type.

  5. Rate limiting to consumers.

  6. UI Visibility - delivery status: event type, failed_reason, response.

Non-functional Requirements:

  1. Reliability - retry

  2. Durability - don't lose event.

  3. Security

  4. Highly available

Scale

read, write pattern? a lot of write.

How many users?

API

CRUD

POST /v1/webhook
Request
{
  user_id
  event_type,
  secret_token,
  destination_url,
  payload JSON,
}

Response:
201 status code
json {
  "id": "webhook_id"
}

GET 
/v1/webhook?user_id=xxx
Request {
  page_number,
  page_size,
}

Response {
  webhook_list: [
    {
      webhook_id,
      destination_url,
      status,
      created_at,
    },
    {
      webhook_id,
      destination_url,
      status,
      created_at,
    }
    ...
  ],
}


GET 
Response: {
  id,
  user_id,
  status,
}

DELETE /v1/webhook (cancel a webhook)

Schema

User Table
Event Table

Webhook Table
{
  webhook_id (partition key)
  owner_id
  event_type
  destination_url
  secret_token
  created_at (sort key)
  is_active: bool
  status: PENDING, SUCCEED, FAILED
}

User Webhook Table
{
  webhook_id
  owner_id + time bucket(partition key)
  event_type
  destination_url
  secret_token
  created_at (sort key)
  is_active: bool
  status: PENDING, SUCCEED, FAILED
}

WebhookTask Table
{
  task_id
  webhook_id
  payload: JSON
  status: PENDING, SUCCEED, FAILED
}

High Level Diagram

E2E

  1. Client register a webhook with a url

  2. Webhook service send a validation request back to client's provided url.

  3. After webhook url is verified, we store this information in DB.

  4. Depending on triggering mechanism, we send sign the url and send webhook request with its payload to destination.

Deep Dive

How to make sure request succeed?

Make a ack protocol between you and client:

To acknowledge receipt of a webhook, your endpoint should return a 2xx HTTP status code. Any other information returned in the request headers or request body is ignored. All response codes outside this range, including 3xx codes, will indicate to Stripe that you did not receive the webhook. This does mean that a URL redirection or a “Not Modified” response will be treated as a failure.

How to handle failure?

Consumer APIs may fail, may timeout, in order to make sure the webhook is sent successfully at least once, we introduces message queue in the middle to decouple the webhook delivery with webhook registration process.

Benefits:

  1. We have retry support, if consumer API timeout or fail, we can reprocess the task again later.

  2. Webhook registration service doesn't have to wait for the webhook task to be delivered in order to register next webhook.

  3. If there are traffic burst, message queue can help smoothen out the load such that workers are not overwhelmed, and we can scale up number of workers to catch up on consumption speed.

How to do retry?

Depending on which message queue we are using, there are different mechanisms.

For traditional message queue like SQS, if the task failed to process for whatever reason, the worker don't acknowledge the message within the visibility timeout period, SQS assumes the message processing failed, and other worker will be able to see this task again for retry.

For Kafka, the worker only commit offset after the task is successfully processed. So if consumer worker dead, some other consumer will be able to start processing from the previous commit offset which will ensure the retry.

We can handle errors with exponential backoff. Each request that results in a non-200 response code or time out will be re-attempted over the course of 10 minutes. Client can see the error in the console UI.

What's wrong with simple retries?

  • Clogged batch processing When we are required to process a large number of messages in real time, repeatedly failed messages can clog batch processing.

We can use separate retry queues to insert failed messages and a separate set of retry consumers can pick up and do the retry. We commit offset in the original topic to unblock the process. For retry queues, we can add several layers. When the handler of a particular topic returns an error response for a given message, it will publish that message onto next retry topic. We use DLQ as the end of line Kafka topic.

We can replay dead letter messages by publishing them back to first retry topic.

For each subsequent level of retry consumers, we can enforce a processing delay.

Pros:

  • Unblock batch processing

  • We can decouple message into granular steps and only retry some of the step. Say the message succeeded in step 1 and failed at step 2, we only publish step2 portion of the job onto retry topic.

  • Observability is better, we have a easy tracing of errored message's path. When and how many times the message has been retried. We can monitor the rate of production into original processing topic versus those of retry topic and DLQ to inform thresholds for automated alerts.

How to verify webhook url ownership?

Send a request to client webhook url served as a verification request.

The verification request will be a GET request with a challenge parameter, which is a random string.

https://www.example.com/webhook?challenge=xxxx

client app should echo back the challenge parameter as the body of its response. Once we receives a valid response, the endpoint is considered a valid webhook, we can start sending notifications to those urls.

client app have ten seconds to responde to the verification request. We will not perform automatic retry for verficiation requests.

How to make sure webhook request are secure?

How to make sure webhook request are secure?

Basic Authentication

Authorization: Basic {base64(username:password)
  1. The developer of the destination application submits their username and password to webhook provider.

  2. Provider first sends a request with no Authorization header. The request is rejected with 401 and destination point sends back an authentication challenge using WWW-Authenticate header.

  3. Producer combine username and password and send base64 version.

  4. Destination endpoint receives the authenticated request, verify the credentials and if valid, allows the webhook.

Signature Verification

  1. A secret key is known by both webhook producer and consumer.

  2. When sending webhook, producer uses this key and cryptographic algorithms like HMAC to create cryptographic hash of the webhook payload.

  3. The signature is sent in a custom header along with the webhook request. The type of algorithm used sometimes is also sent.

X-Hub-Signature-256. Using HMAC hex digest.
  1. When webhook arrives at webhook URL, the receiving application takes the webhook payload and uses the secret key and cryptographic algorithm the calculate the signature.

  2. The calculated signature is then compared with that sent by producer in the custom header. If there is a match then the request is valid.

Prevent Replay attack

A reply attack occurs when an attacker gets hold of an authenticated request and repeats it, thereby causing duplicated webhooks.

To prevent replay attacks, signature verification allows you to add a timestamp that can be used to expire webhook after a certain period of time, ex: 2 mins. This time can be adjusted based on security requirements.

When webhook hits the webhook URL, it's checked against current time to see if it's still valid for use. If timestamp is too old, webhook is rejected.

What happen if retry doesn't work?

If issue comes from our end?

In SQS, if the configured maximum receive count for a message is reached without being successfully processed, SQS will move the message into a dead letter queue. We can investigate and fix issue and playback the messages later.

If issue comes from client end?

If client webhook url returns more than a percentage of errors in the past 10 minutes, or 5% failure rate. We can disable client's webhook and notify them through email. They can re-enable their webhook in console UI.

How to scale?

  1. Do load test to figure out the bottleneck.

Potential bottlenecks:

  • API/Webhook delivery servers: deploy on k8s and use auto scale.

  • Cassandra DB: figure out partition key and sort key to do partition and increase on write throughput?

  • Kafka: Add more specific topics and more partitions within the topic.

Why we need extra logging and monitoring?

Web-hook is hard to know where it failed, client won't be able to know. You will have to know.

Security issue?

  1. Server side request forgery (SSRF) -> make sure customer don't abuse your internal network.

  2. Set fixed set of proxies in order to get authenticated by other big company's firewall.

Deliverability

Filter out event based on event types ana schema

Rate limit on outgoing webhooks.

Webhook from hookdeck
Retry from Uber
Initial Approach without MQ
Drawing
Drawing
Drawing