Linux Scheduler

An Event-Driven Perspective on Developing a New Linux Kernel Scheduler

Alex Jiakai XU, Ryan Wee

Last updated on Apr 8, 2025 4 min read

This guide explores an experimental Linux scheduler, that adopts a naive, non-preemptive Round-Robin approach to task management. Developed as part of a teaching exercise, the scheduler implements a simplified workflow where tasks are cycled through a runqueue without the typical preemptive interruptions found in production systems. In this setup, a task only yields the CPU upon the expiration of its allocated timeslice or through voluntary yielding, and the scheduler then reorders the tasks accordingly. The guide offers a detailed, event-driven perspective on how these operations are handled, making it a valuable resource for understanding core scheduling concepts in operating systems.

When a Task’s Timeslice Expires

The timer interrupt code calls scheduler_tick(), with HZ frequency.
This function subsequently calls the current task’s task_tick() function by curr->sched_class->task_tick().
In the implementation for the scheduler, this function decrements the task’s timeslice (using jiffies as the unit).
When the timeslice reaches zero, the following steps occur while the task remains on the CPU:
- The task is repositioned to the end of the runqueue.
- The task_tick() function calls resched_curr(), setting the TIF_NEED_RESCHED flag meaning the task needs to be rescheduled.
On subsequent interrupts or when returning to user space, the kernel checks for the TIF_NEED_RESCHED flag and, if set, invokes schedule(), which underlying invokes __schedule().
This huge if block is NOT executed.
- In particular, recall that the task_struct has a __state field. When a task is RUNNABLE, the state field is zero. When a task is not RUNNABLE, the state field is non-zero.
- In this case, the task is still RUNNABLE, i.e. prev_state is zero. So the if block is not executed.
__schedule() eventually calls pick_next_task() here.
Our implementation of pick_next_task() retrieves the task at the front of the runqueue.
Finally, the kernel performs a context switch via context_switch() here, transferring control to the newly selected task.

When a Task Voluntarily Yields While Remaining RUNNABLE

In scenarios where a task voluntarily yields the CPU but intends to remain in the runqueue, it signals its willingness to be moved to the back of the queue without fully relinquishing its eligibility for immediate rescheduling. This process unfolds as follows:

The task calls the sched_yield system call, which leads to the invocation of do_sched_yield().
do_sched_yield() then calls the task’s yield_task() function at current->sched_class->yield_task().
Within yield_task(), the task’s timeslice is set to zero.

On the next timer interrupt, the scheduler treats this as if the task’s timeslice has expired, preempting it accordingly and repositioning it in the runqueue.

When a Task Voluntarily Yields and Is Not RUNNABLE

When a task decides to yield the CPU and no longer wishes to remain in the runqueue—such as when entering a sleep state or upon completion of its execution—the procedure is slightly different:

The task alters its state to indicate that it is not RUNNABLE and calls __schedule() with a non-preemptive flag.
- In the case of task termination, the exit() system call is made, which leads to the invocation of do_exit() and subsequently do_task_dead(). The task state is updated accordingly before calling __schedule().
- For tasks entering a sleep state (e.g., via nanosleep() or similar functions), the task state is similarly updated here, and then invokes `schedule().
In both cases, the updated task state (non-zero, indicating non-RUNNABLE) triggers a branch in __schedule() that calls deactivate_task(). This function, in turn, invokes dequeue_task() implementation of the current scheduling class, which should remove the task from the runqueue.
Recall that we’re still within an earlier invocation of __schedule(). This invocation calls pick_next_task() and context_switch() as mentioned above.
With the task removed, the scheduler selects the next task from the runqueue to execute.

The code snippets and links in this post correspond to Linux v6.8.

I warmly welcome any bug reports or suggestions for improvement regarding this scheduler implementation. Your feedback is invaluable in refining the code and clarifying the underlying concepts. Additionally, I invite future operating systems course teaching teams to reference this page as a resource when preparing their own course materials. Collaboration and shared insights help us all enhance the educational experience.

Academic Teaching

Alex Jiakai XU

Computer Science Student

My research interests include computer systems, programming languages, software architecture, and cyberspace security.