Design Youtube/Netflix
Topics:
How to design schema for video segment? for partitioning.
CDN and Redis cache usage?
How to achieve low latency?
How to de-duplicate videos?
How to implement search?
Mention Vitess database abstraction layer.
Adaptive streaming?
How to do video recommendation?
Functional Requirement
Users can view video.
Users can see a list of recommended video on homepage.
Users can search video based on keyword.
Users can upload videos.
Optional:
Uploaded video can reviewed and censored.
Like and dislike videos
Add comments to videos.
Non-functional Requirement:
Low latency
Highly available
Scalable
Fault tolerance.
Availability > consistency
Read > Write
High Level Design
Client send a GET request to API Gateway for fetching videos.
request get routed to Video service
video service determines the user id for the client and fetch recommended videos from DB.
video service send back the response with a list of videos.
client be able to see videos in the client page.
API
Data Schema
Scale
2B DAU, how many new videos a day
QPS: 20B requests = 20*10^9 / 10^5 = 20*10^4 = 200k QPS
Storage:
Avg video length: 5 mins
Size before compression: 600MB
Size after compression: 30MB
500 hrs video is uploaded every minute
6MB to store a minute of video
total = 6*500*60 = 180000 MB per minute = 180GB per minute = 180*24*60 = 259 TB per day = 94TB per year.
Bandwidth
500 hr/min * 60 min * 120mb/min * 8 bits / 60s = 480Gbps
Use dynamo DB for large read requests and partition data easily with schema key design
DynamoDB Limitations: 1000WCU/s, 3000RCU/s.
Use Redis cache to store frequent read data, viral video posted by big influencers.
How to deduplicate video?
Assume 50 out of 500 hours of videos uploaded to Youtube are duplicates. Considering the one minute of video requires 6MB of storage space, the duplicated content will take up following storage space:
If we avoid video duplication, we can save up to 9.5 perabytes of storage space.
There is also copyright issue, No content creator would want their content plagiarized.
Options:
Locality-sensitve hashing.
Block matching algorithms, phase correlation
AI
Ateliere's proprietary FrameDNA™ AI/ML technology revolutionizes video management by fingerprinting each frame upon ingest. This allows for an accurate comparison of video files. This advanced technology not only helps in identifying duplicate content but also assists in detecting any alterations or tampering within the video files. Additionally, the system's efficient storage management capabilities ensure that only the most relevant and original content is preserved, optimizing storage resources and reducing unnecessary duplication.
Adaptive Streaming
While the content is being served, the bandwidth of the user is also being monitored. Since the video is divided into chunks of different qualities, each video clip can be provided based on changing network conditions.
The adaptive bitrate algorithm can bsed on four parameters:
End-to-end available bandwidth (from a CDN/servers to a specific client)
The device capabilities of the user.
Encoding techniques used.
The buffer space at the client.
Recommendation
Youtube recommends video to user based on their profile, taking into account factors such as their interests, view and search history, subscribed channels, related topics to already viewed content and activities on content such as comments and likes.
Youtube filters videos in two phases:
Candidate generation: millions of Youtube videos are filtered down to hundreds based on the user's history and current context.
Ranking: The ranking phase rates videos based on their feature and according to the user's interests and history. Hundreds of videos are filtered and ranked down to a few dozen videos during the phase.
Collaborative Filtering
A technique used in recommendation systems, works by predicting a user's interests based on preferences of many users.
User-based collaborative filtering: The approach recommends items by finding similar users. For example, if user X likes items A, B and C and user Y likes item A, B and D. The system infer that X might also like item D because Y likes it.
https://blog.hootsuite.com/how-the-youtube-algorithm-works/
2005-2011: Optimizing for clicks & views
2012: Optimizing for watch time
2015-2016: Optimizing for satisfaction: Shares, likes and Dislikes, not interested button.
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf
Last updated