| Title | : | Parallel Architectures For Priority Scheduling In Programmable Packet Schedulers |
| Speaker | : | Nikhil Shinde (IIT Madras) |
| Details | : | Thu, 29 Jan, 2026 10:00 AM @ SSB334 |
| Abstract: | : | Programmable data-plane (PDP) switches enable flexible packet processing but remain constrained in scheduling capabilities, particularly for priority-based policies that require strict ordering of packets according to user-defined ranks. Push-In First-Out (PIFO) provides an ideal abstraction for programmable packet scheduling; however, its requirement for line-rate packet sorting makes it impractical at high network speeds (100G and beyond). PIFO is a priority queue that enables ordering enqueued packets based on their priority. Alternate accurate schedulers like SIFTER, involve complex packet movement between the scheduling queues. Approximate schedulers such as SP-PIFO and AIFO, RIFO reduce complexity but introduce priority inversions, degrading latency, fairness, and flow completion times (FCTs) as they rely on First-In First-Out (FIFO) queues or use Active Queue Management (AQM) as a substitute to scheduling. To address the scalability limitations of accurate schedulers while avoiding the correctness issues of approximate schedulers, this work proposes two architectures -- called P3PO and PPIFO -- that are scalable, inversion-free and parallelize the sorting workload across multiple priority queues. P3PO and PPIFO decompose a large global priority queue into multiple smaller, bounded-capacity priority queues that operate in a parallel manner. P3PO uses a cascaded set of priority queues with priority-based bounds to ensure that packets are inserted into the earliest queue capable of maintaining ordering; overflow is handled by cascading evictions into lower-priority queues. Under-utilization of a queue triggers promotion of packets from lower priority queues to higher priority queues to maintain availability for timely dequeue. PPIFO distributes packets across multiple independent priority queues based on occupancy, using a de-multiplexer for balanced enqueue and a multiplexer for globally selecting the highest-priority packet at dequeue. Both designs ensure bounded sorting complexity, preserve strict priority ordering, and eliminate priority inversions. The architectures are implemented in the NetBench packet-level simulator, and evaluated using empirical datacenter workloads (Web-search and Data Mining) under in-cast and all-to-all traffic patterns. Experiments are conducted on star and leaf-spine topologies using Start-Time Fair Queueing (STFQ) and pFabric-style ranking. Results show that P3PO achieves fairness and flow completion time (FCT) performance comparable to ideal PIFO and performs better than approximate schedulers. We show that P3PO and PPIFO have less packet comparisons as compared to ideal PIFO, thus reducing the sorting latency. For flows under 2 MB, P3PO improves FCTs by 14–62%, and for flows exceeding 2 MB by 47–60% as compared to approximate schedulers. Packet drops are reduced by 66–95% due to improved queue utilization and elimination of inversion-driven tail latencies against approximate schedulers. |
