Design Twitter
Topics
SQL vs NoSQL
How to do index on db schema?
How to do pagination on news feed?
随着user, requests, data size increase, how to scale the system?
celebrity: different celebrity data pattern diffs a lot, do we need different message queue to fan out?
How to deal with thundering herd problem?
How to validate and limit malicious user behavior (send 100 tweets in one minute?)
How to block sensitive words in tweet?
How to design hash tag?
Functional Requirements
Post tweets: registered user can post one or more tweets.
View user or home timeline.
Delete tweets: can delete one or more tweets on twitter.
Follow or unfollow.
Like/dislike
Reply to tweet.
Retweet.
Search tweet.
Hashtags.
Do we support media types like video or images?
Non-functional Requirement
Highly available
Low latency
read, write ratio: 10000:1
Eventual consistency, availability > consistency.
Scale
QPS
100M active users -> 500M tweets per day:
Each tweet averages a fanout of 10 deliveries -> 5B total tweets delivered on fanout each day.
10B read requests per day -> 10^4 QPS.
10B search per month.
Data
API
Data Schema
High Level Diagram
Client send request, either post a tweet or view home timeline.
Web server read and write from DB.
Return the response back to client.
Single point of failures:
Web servers
DB
Posting a tweet:
user send post tweet request load balancer, it get routes onto one of the web servers.
web server write this record onto DB.
Viewing a tweet:
user send read request for home timeline.
web server read tweets from DB.
Same tweets might be read multiple times by different followers.
More read burden on DB. So we can fanout during the write phase for posting tweet:
Each user store a list of followers in the cache.
We push the tweets into user timeline inbox in the cache.
Posting a tweet:
user send post tweet request load balancer, it get routes onto one of the web servers.
web server write this record onto DB.
Fanout to store tweets in each follower's tweet list cache.
Viewing a tweet:
user send read request for home timeline.
read from redis first to see if there are any tweets.
it not enough, we read tweets from DB.
Same tweets might be read multiple times by different followers.
Pros:
We reduce load on read by doing fanout.
Cons:
If the tweet poster is a celebrity, the fanout overhead is huge.
We need Redis to scale.
Celebrity case:
Instead of fanout, we store some cache for celebrity in Redis as well:
Deep Dive
How to deal with thundering herd problem?
Cascading failure: Some server crashed, and other servers have to take on additional load, but they might not be able to handle it either so all of them crashed one by one.
Rate limiting: Token Bucket/ Leaky Bucket/Sliding Window
For each request we can categorize the user by maintaining an in-memory map or in redis, with user as key and a token. On each request from the same user the token is decreased by one and when it reaches zero we throw temporary error to user.
It might be space intensive if there are too many users...
Context: Some celebrity tweets trends, difficult to determine the load beforehand. Auto scale might not work either.
Redis
Keep only several hundred tweets for each home timeline in the Memory Cache
Keep only active users' home timeline info in the Memory Cache
If a user was not previously active in the past 30 days, we could rebuild the timeline from the SQL Database
Query the User Graph Service to determine who the user is following
Get the tweets from the SQL Database and add them to the Memory Cache
How to update cache?
cache aside
application is responsible for reading and writing from storage.
write-through
application uses the cache as the main data store, reading and writing data to it without interacting with db.
Pros: read are fast.
Cons:
write-through is a slow overall operation due to write operation.
Most data written might never be read, which can be minimized by TTL config.
write-behind
refresh ahead
How to do censorship for tweets?
We can train a model or use some existing models to do that?
GPT 3.5 API call?
Last updated