Rui Ning 「宁锐」

Email: rning [at] smail.nju.edu.cn

I am currently an undergraduate in the Software Institute at Nanjing University, working at the intersection between datacenter network and machine learning system.

I was fortunate to be advised by Prof. Chen Tian. My research has received recognition through one Best Student Paper Award at top-tier system conferences.

Research Highlights

My research mainly focuses on building datacenter network system and machine learning system for large-scale machine learning workloads, with a recent emphasis on large language models (LLMs). In particular, I'm attempting to bridge the performance gap between emerging machine learning applications (e.g., GenAI) and datacenter network system.

^★ indicates projects where I am the project lead

Datacenter Network Architecture:

PortFC [ICS’25]^★ delivers deadlock-free, egress-port based flow control algorithm that mitigates head-of-line (HoL) blocking.
PrioPlus [EuroSys’25, 🏆 Best Student Paper] pioneers a drop-in enhancement to congestion control algorithm, enabling virtual priority in datacenter networks.

LLM Inference Acceleration:

FlashForge [arXiv’25]^★ proposes a prefix-aware CUDA attention kernel to coalesce memory accesses arising from shared prefixes in the decoding stage.

News

[Jun 2025] Travel: I will be attending ACM ICS’25 June 8-11 in Salt Lake City, Utah. Feel free to reach out if you want to chat!
[Apr 2025] Paper: Our PortFC paper has been accepted to ACM ICS’25!
[Apr 2025] Award: Our PrioPlus paper has received the Best Student Paper Award at ACM EuroSys'25.
[Jan 2025] Paper: Our PrioPlus paper has been accepted to ACM EuroSys’25!

Publications

Designing High-performance Deadlock-free BCube Networks
Peirui Cao, Rui Ning, Hongwei Yang, Zhaochen Zhang, Chang Liu, Rui Li, Yongqi Yang, Yunzhuo Liu, Chengyuan Huang, Tao Sun, Xiaodong Duan, Guihai Chen, Chen Tian
ACM ICSACM International Conference on Supercomputing, 2025 | [abs] | [bib]

                    BCube is a modular data center network. Compared with other topologies, BCube has natural advantages, such as lower deployment costs and stronger failure recovery capabilities. However, RDMA technology used in BCube still faces challenges, including high retransmission overhead, Head-of-Line Blocking (HoLB) and deadlock problems. Existing solutions for traditional data centers cannot simultaneously address these issues due to the unique topology and server transmission characteristics of BCube. In this paper, we propose a per-port flow control named PortFC for BCube. PortFC addresses the above problems through the designs of a Pause/Resume control signal, a per-port queue allocation method, an egress-detecting per-port flow control mechanism, and a server-aware queue scheduling method. Our evaluation shows that PortFC is free from retransmission, capable of eliminating HoLB and avoiding deadlocks. PortFC achieves 1.7-8.0 times higher throughput and reduces latency by 11.7%-87.7% compared to the state-of-the-art lossy RDMA based on IRN and the lossless RDMA method based on PFC.

                    @inproceedings{10.1145/3721145.3725749,
					author = {Cao, Peirui and Ning, Rui and Yang, Hongwei and Zhang, Zhaochen and Liu, Chang and Li, Rui and Yang, Yongqi and Liu, Yunzhuo and Huang, Chengyuan and Sun, Tao and Duan, Xiaodong and Chen, Guihai and Tian, Chen},
					title = {PortFC: Designing High-performance Deadlock-free BCube Networks},
					year = {2025},
					isbn = {9798400715372},
					publisher = {Association for Computing Machinery},
					address = {New York, NY, USA},
					url = {https://doi.org/10.1145/3721145.3725749},
					doi = {10.1145/3721145.3725749},
					booktitle = {Proceedings of the 39th ACM International Conference on Supercomputing},
					pages = {1052–1063},
					numpages = {12},
					keywords = {Data Center Networks, Flow Control, Modular Data Center},
					series = {ICS '25}
					}

🏆 Enabling Virtual Priority in Data Center Congestion Control
Zhaochen Zhang, Feiyang Xue, Keqiang He, Zhimeng Yin, Gianni Antichi, Jiaqi Gao, Yizhi Wang, Rui Ning, Haixin Nan, Xu Zhang, Peirui Cao, Xiaoliang Wang, Wanchun Dou, Guihai Chen, Chen Tian
ACM EuroSysACM European Conference on Computer Systems, 2025 [Best Student Paper Award] | [abs] | [bib] |

In data center networks, various types of traffic with strict performance requirements operate simultaneously, necessitating effective isolation and scheduling through priority queues. However, most switches support only around ten priority queues. Virtual priority can address this limitation by emulating multi-priority queues on a single physical queue, but existing solutions often require complex switch-level scheduling and hardware changes. Our key insight is that virtual priority can be achieved by carefully managing bandwidth contention in a physical queue, which is traditionally handled by congestion control (CC) algorithms. Hence, the virtual priority mechanism needs to be tightly coupled with CC. In this paper, we propose PrioPlus, a CC enhancement algorithm that can be integrated with existing congestion control schemes to enable virtual priority transmission. PrioPlus assigns specific delay ranges to different priority levels, ensuring that flows transmit only when the delay is within the assigned range, effectively meeting virtual priority requirements. Compared to Swift CC with physical priority queues, PrioPlus provides strict priority for high-priority flows without impacting performance sensibly. Meanwhile, it benefits low-priority flows from 25% to 41% as its priority-aware design enhances CC's ability to fully utilize available bandwidth once higher-priority traffic completes. As a result, in coflow and model training scenarios, PrioPlus improves job completion times by 21% and 33%, respectively, compared to Swift with physical priority queues.

                    @inproceedings{10.1145/3689031.3717463,
					author = {Zhang, Zhaochen and Xue, Feiyang and He, Keqiang and Yin, Zhimeng and Antichi, Gianni and Gao, Jiaqi and Wang, Yizhi and Ning, Rui and Nan, Haixin and Zhang, Xu and Cao, Peirui and Wang, Xiaoliang and Dou, Wanchun and Chen, Guihai and Tian, Chen},
					title = {Enabling Virtual Priority in Data Center Congestion Control},
					year = {2025},
					isbn = {9798400711961},
					publisher = {Association for Computing Machinery},
					address = {New York, NY, USA},
					url = {https://doi.org/10.1145/3689031.3717463},
					doi = {10.1145/3689031.3717463},
					booktitle = {Proceedings of the Twentieth European Conference on Computer Systems},
					pages = {396–412},
					numpages = {17},
					keywords = {Congestion control algorithm, Data center network, In-network priority},
					location = {Rotterdam, Netherlands},
					series = {EuroSys '25}
					}

Preprints

Ultra-Efficient Prefix-Aware Attention for LLM Decoding
Zhibin Wang∗, Rui Ning∗, Chao Fang, Zhonghui Zhang, Xi Lin, Shaobo Ma, Mo Zhou, Xue Li, Zhongfeng Wang, Chengying Huan, Rong Gu, Kun Yang, Guihai Chen, Sheng Zhong, Chen Tian
arXiv:2505.17694, 2025 | [abs] | [bib]

                    Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely FlashForge. FlashForge delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that FlashForge achieves an average 1.9x speedup and 120.9x memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and 3.8x end-to-end time per output token compared to the vLLM.

                    @misc{wang2025flashforgeultraefficientprefixawareattention,
					title={FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding},
					author={Zhibin Wang and Rui Ning and Chao Fang and Zhonghui Zhang and Xi Lin and Shaobo Ma and Mo Zhou and Xue Li and Zhongfeng Wang and Chengying Huan and Rong Gu and Kun Yang and Guihai Chen and Sheng Zhong and Chen Tian},
					year={2025},
					eprint={2505.17694},
					archivePrefix={arXiv},
					primaryClass={cs.LG},
					url={https://arxiv.org/abs/2505.17694},
					}

Education

Nanjing University, China
B.E. in Software Engineering

Overall GPA: 4.7/5.0 (Major GPA: 4.7/5.0)
Ranking: 2/216

Sep. 2022 - Jun. 2026

Teaching

C Programming Language
Fall 2024 & Fall 2025, TA, Nanjing Universiy

Awards & Honors

EuroSys’25 Best Student Paper, ACM EuroSys, 2025
ASC Student Supercomputer Challenge First Prize, ASC Committee, 2025
Chinese Mathematics Competition First Prize, Chinese Mathematical Society, 2023

Scholarship

High-Value Scholarship × 3 (Top 1%), Nanjing University, 2023-2025

Talks

Performance Optimization: Theory and Practice
- C Programming Language @ Nanjing University, Nanjing, China, Nov. 2025
PortFC: Designing High-performance Deadlock-free BCube Networks
- ACM ICS'25, Salt Lake City, Utah, Jun. 2025
Efficient Code Debugging: System Skills in C and Unix
- C Programming Language @ Nanjing University, Nanjing, China, Nov. 2024