Design Job Scheduler
Design a job scheduler that runs jobs at a scheduled interval.
Topics:
RDBMS vs NoSQL?
SQS vs Kafka?
How to handle at-least once?
How to make sure no concurrent worker working on same job? Task idempotency?
Execution Cap?
How to do prioritization? Using different queue?
Functional Requirements
Submit task: allow the user to submit their tasks for execution.
Allocate resources: allocate require resources to each task.
Remove tasks: should allow cancel submitted tasks.
Monitor task execution: should be adequately monitored and rescheduled if the task fails to execute.
Show task status: User can view the status of a executed job.
User can schedule a cron job with a schedule.
For scheduled jobs, user can limit its max concurrency.
Support different languages.
Non-functional Requirement
Submitted job cannot be lost. Durability.
Availability.
Scalability: should be able to schedule and execute an ever-increasing number of tasks per day.
Reliability - retry
Efficient resource utilization.
Release resources: after executing a task, the system should take back resources assigned to the task.
High Level Diagram
E2E
User submit/get job connecting to API Gateway.
Request get persisted in DB, acknowledgement get sent back to user.
Job executor service will continuously poll the due jobs from DB and insert entries into the queue.
Job executor service execute the business logic and update final result onto file system and update the status as COMPLETED.
Data Schema
TaskID
String
Uniquely identifies each task
UserID
String
UUID of user
SchedulingType
String
{once, daily, weekly, monthly, anually}
TotalAttempts
Integer
maximum number of retries in case a task execution fails.
ResourceRequirements
String
{Basic, Regular, Premium}
ExecutionCap
Time
maximum time allowed for task execution.
DelayTolerance
Time
indicates how much delay we can sustain before starting a task.
ScriptPath
String
The path of the script needs to be executed. The script is a file placed in a file system.
Deep Dive
Job Scheduling Flow
Every X minute, the master node creates an authoritative UNIX timestamp and assigns a shard_id and schedule_job_execution_time to each worker.
Worker node will execute DB query and push jobs inside the Kafka queue for execution.
Fault-tolerance
Master monitors health of workers and knows which worker is dead and how to re-assign the query to new worker
If master dies, we can allocate other worker node as master. (Automatic fail-over)
Introduce a local DB to track the status if worker has queries the DB and put the entry inside queue.
Job Executor Flow
When a job is picked up from the queue, consumer's master updates JOB db attribution execution_status = CLAIMED.
When worker process picks up the work, it updates execution_status = PROCESSING and continuously send health check to local DB.
Upon completion of a job, worker process will push the result inside S3, update JOB db execution_status = COMPLETED and local db with the status.
Both worker processes and master will update the health check inside local database.
Last updated