Futuristic autonomous vehicle navigating a virtual cityscape, symbolizing GPU scheduling and virtualization.

Unlocking the Secrets of GPU Scheduling: How Virtualization is Revolutionizing Autonomous Driving

"Dive into the inner workings of NVIDIA's GPU scheduling on Drive PX platforms and explore how virtualization can enable real-time performance in autonomous vehicles."


The race to fully autonomous vehicles is fueled by advanced computing platforms, with Graphics Processing Units (GPUs) at the forefront. These GPUs offer the massively parallel processing power required for complex tasks such as real-time object detection, path planning, and sensor fusion. To ensure the safety and reliability of autonomous driving systems, it's critical to have GPU scheduling approaches that provide strong real-time guarantees. This means ensuring that critical tasks are completed within strict time constraints, regardless of other system activities.

Previous research has focused on reverse engineering the GPU ecosystem to understand and control GPU scheduling on NVIDIA platforms. However, this article offers an in-depth look at NVIDIA's standard approach to GPU application scheduling on a Drive PX platform, providing valuable insights into the inner workings of this complex system. Furthermore, we'll explore how a privileged scheduling server can be used to enforce custom scheduling policies in a virtualized environment, opening up new possibilities for real-time GPU performance.

Advanced Driver-Assistance Systems (ADAS) rely heavily on integrated GPUs, shared across various applications with different timing needs. We'll examine NVIDIA's GPU scheduling approach for graphic and compute tasks on the Drive PX-2 'AutoCruise' platform. This board features a Tegra Parker SoC with an exa-core CPU and an integrated GPU (gp10b), a version of the Pascal Architecture with two Streaming Multiprocessors (SMs) and 128 CUDA cores each.

GPU Scheduling: A Deep Dive

Futuristic autonomous vehicle navigating a virtual cityscape, symbolizing GPU scheduling and virtualization.

The NVIDIA GPU scheduler uses a hardware controller embedded within the GPU, called the 'Host.' This component dispatches work to GPU engines (Copy, Compute, Graphics) in a round-robin manner, asynchronously and parallel to the CPU. The Host scheduler manages channels, which are independent streams of work for user-space applications. These channels are transparent to programmers, who use APIs (CUDA, OpenGL) to specify GPU workloads.

Workloads consist of sequences of GPU commands inserted into a Command Push Buffer, a memory region written by the CPU and read by the GPU. Channels are linked to application's Command Push buffers. Each channel has a timeslice value for timesharing the GPU. Context switches occur when a channel's work is done or its timeslice expires. The Host then dispatches work from the next channel on a list called the runlist.
  • Timeslice Length: The duration a channel can execute before preemption.
  • Interleaving Level: The number of times a channel appears in the runlist.
  • Preemption Policy: Determines if a channel can be preempted.
  • Channel establishment: Channels are set at the start of application launch.
The GPU Host uses a list-based scheduling policy, snooping each channel for work by browsing the runlist. Each application has runlist entries proportional to its interleaving level. The scheduler checks each entry for workload in the Command Push Buffer. If work exists, the channel is scheduled until completion or timeslice expiration, leading to preemption and later resumption. If no work exists, the scheduler skips to the next application's channels. An open-source version of the runlist construction algorithm is available in the NVIDIA kernel driver stack (L4T, Linux For Tegra).

The Future of GPU Virtualization

NVIDIA GPU virtualization technology enables multiple guests to run and access GPU engines via a privileged hypervisor guest, the RunList Manager (RLM). Guests interact with the RLM server for channel allocations, scheduling, memory management, and runlist construction. Future work involves modifying the GPU to RLM communication to allow the RLM to intercept command submissions, define SW scheduling policies, and enforce them by constructing runlists with scheduled application channels. This enables testing event-based approaches for stronger real-time guarantees compared to NVIDIA's interleaved scheduler. Preliminary results of an Earliest Deadline First with Constant Bandwidth Server (EDF+CBS) prototype show significant improvements in schedulability and Worst Case Response Time (WCRT).

Newsletter Subscribe

Subscribe to get the latest articles and insights directly in your inbox.