Design Webhook

Topics

How to handle failure?
How to do retry?
How to scale?
How to validate webhook url ownership?
How to make sure webhook request are secure?
Rate limiting and VPC before sending the request?
Why to use Kafka vs SQS? Pros and Cons?
How to filter different event types, can we use Topic from Kafka?

Qs:

What delivery semantic do we provide?
1. At least once
2. At most once
3. Exactly once
If we resend webhook, could we assume the endpoint to be idempotent?
Are we designing 1st party webhooks or 3rd party webhooks? This is critical since it determines the webhook triggering mechanism. If can be either triggered by us, or triggered by our client.
Do we support canceling the webhook?

Functional Requirement:

Customer could register multiple webhook
Verify ownership of webhook url
Security validation for webhook
Filtering by event type.
Rate limiting to consumers.
UI Visibility - delivery status: event type, failed_reason, response.

Non-functional Requirements:

Reliability - retry
Durability - don't lose event.
Security
Highly available

Scale

read, write pattern? a lot of write.

How many users?

API

CRUD

POST /v1/webhook
Request
{
  user_id
  event_type,
  secret_token,
  destination_url,
  payload JSON,
}

Response:
201 status code
json {
  "id": "webhook_id"
}

GET 
/v1/webhook?user_id=xxx
Request {
  page_number,
  page_size,
}

Response {
  webhook_list: [
    {
      webhook_id,
      destination_url,
      status,
      created_at,
    },
    {
      webhook_id,
      destination_url,
      status,
      created_at,
    }
    ...
  ],
}


GET 
Response: {
  id,
  user_id,
  status,
}

DELETE /v1/webhook (cancel a webhook)

Schema

User Table
Event Table

Webhook Table
{
  webhook_id (partition key)
  owner_id
  event_type
  destination_url
  secret_token
  created_at (sort key)
  is_active: bool
  status: PENDING, SUCCEED, FAILED
}

User Webhook Table
{
  webhook_id
  owner_id + time bucket(partition key)
  event_type
  destination_url
  secret_token
  created_at (sort key)
  is_active: bool
  status: PENDING, SUCCEED, FAILED
}

WebhookTask Table
{
  task_id
  webhook_id
  payload: JSON
  status: PENDING, SUCCEED, FAILED
}

High Level Diagram

E2E

Client register a webhook with a url
Webhook service send a validation request back to client's provided url.
After webhook url is verified, we store this information in DB.
Depending on triggering mechanism, we send sign the url and send webhook request with its payload to destination.

Deep Dive

How to make sure request succeed?

Make a ack protocol between you and client:

To acknowledge receipt of a webhook, your endpoint should return a 2xx HTTP status code. Any other information returned in the request headers or request body is ignored. All response codes outside this range, including 3xx codes, will indicate to Stripe that you did not receive the webhook. This does mean that a URL redirection or a “Not Modified” response will be treated as a failure.

How to handle failure?

Consumer APIs may fail, may timeout, in order to make sure the webhook is sent successfully at least once, we introduces message queue in the middle to decouple the webhook delivery with webhook registration process.

Benefits:

We have retry support, if consumer API timeout or fail, we can reprocess the task again later.
Webhook registration service doesn't have to wait for the webhook task to be delivered in order to register next webhook.
If there are traffic burst, message queue can help smoothen out the load such that workers are not overwhelmed, and we can scale up number of workers to catch up on consumption speed.

How to do retry?

Depending on which message queue we are using, there are different mechanisms.

For traditional message queue like SQS, if the task failed to process for whatever reason, the worker don't acknowledge the message within the visibility timeout period, SQS assumes the message processing failed, and other worker will be able to see this task again for retry.

For Kafka, the worker only commit offset after the task is successfully processed. So if consumer worker dead, some other consumer will be able to start processing from the previous commit offset which will ensure the retry.

We can handle errors with exponential backoff. Each request that results in a non-200 response code or time out will be re-attempted over the course of 10 minutes. Client can see the error in the console UI.

What's wrong with simple retries?

Clogged batch processing When we are required to process a large number of messages in real time, repeatedly failed messages can clog batch processing.

We can use separate retry queues to insert failed messages and a separate set of retry consumers can pick up and do the retry. We commit offset in the original topic to unblock the process. For retry queues, we can add several layers. When the handler of a particular topic returns an error response for a given message, it will publish that message onto next retry topic. We use DLQ as the end of line Kafka topic.

We can replay dead letter messages by publishing them back to first retry topic.

For each subsequent level of retry consumers, we can enforce a processing delay.

Pros:

Unblock batch processing
We can decouple message into granular steps and only retry some of the step. Say the message succeeded in step 1 and failed at step 2, we only publish step2 portion of the job onto retry topic.
Observability is better, we have a easy tracing of errored message's path. When and how many times the message has been retried. We can monitor the rate of production into original processing topic versus those of retry topic and DLQ to inform thresholds for automated alerts.

How to verify webhook url ownership?

Send a request to client webhook url served as a verification request.

The verification request will be a GET request with a challenge parameter, which is a random string.

https://www.example.com/webhook?challenge=xxxx

client app should echo back the challenge parameter as the body of its response. Once we receives a valid response, the endpoint is considered a valid webhook, we can start sending notifications to those urls.

client app have ten seconds to responde to the verification request. We will not perform automatic retry for verficiation requests.

How to make sure webhook request are secure?

Basic Authentication

Authorization: Basic {base64(username:password)

The developer of the destination application submits their username and password to webhook provider.
Provider first sends a request with no Authorization header. The request is rejected with 401 and destination point sends back an authentication challenge using WWW-Authenticate header.
Producer combine username and password and send base64 version.
Destination endpoint receives the authenticated request, verify the credentials and if valid, allows the webhook.

Signature Verification

A secret key is known by both webhook producer and consumer.
When sending webhook, producer uses this key and cryptographic algorithms like HMAC to create cryptographic hash of the webhook payload.
The signature is sent in a custom header along with the webhook request. The type of algorithm used sometimes is also sent.

X-Hub-Signature-256. Using HMAC hex digest.

When webhook arrives at webhook URL, the receiving application takes the webhook payload and uses the secret key and cryptographic algorithm the calculate the signature.
The calculated signature is then compared with that sent by producer in the custom header. If there is a match then the request is valid.

Prevent Replay attack

A reply attack occurs when an attacker gets hold of an authenticated request and repeats it, thereby causing duplicated webhooks.

To prevent replay attacks, signature verification allows you to add a timestamp that can be used to expire webhook after a certain period of time, ex: 2 mins. This time can be adjusted based on security requirements.

When webhook hits the webhook URL, it's checked against current time to see if it's still valid for use. If timestamp is too old, webhook is rejected.

What happen if retry doesn't work?

If issue comes from our end?

In SQS, if the configured maximum receive count for a message is reached without being successfully processed, SQS will move the message into a dead letter queue. We can investigate and fix issue and playback the messages later.

If issue comes from client end?

If client webhook url returns more than a percentage of errors in the past 10 minutes, or 5% failure rate. We can disable client's webhook and notify them through email. They can re-enable their webhook in console UI.

How to scale?

Do load test to figure out the bottleneck.

Potential bottlenecks:

API/Webhook delivery servers: deploy on k8s and use auto scale.
Cassandra DB: figure out partition key and sort key to do partition and increase on write throughput?
Kafka: Add more specific topics and more partitions within the topic.

Why we need extra logging and monitoring?

Web-hook is hard to know where it failed, client won't be able to know. You will have to know.

Security issue?

Server side request forgery (SSRF) -> make sure customer don't abuse your internal network.
Set fixed set of proxies in order to get authenticated by other big company's firewall.

Deliverability

Filter out event based on event types ana schema

Rate limit on outgoing webhooks.

PreviousDesign Google Doc NextValidate Instacart Shopper Checkout

Last updated 1 year ago