Intel/VMware Crossroads 3D-FPGA Academic Research Center

Publications and Downloads

You can watch available presentations of publications using our YouTube playlist! Also check out our GitHub for opensourced IPs.

2024

BBQ: A Fast and Scalable Integer Priority Queue for Hardware Packet Scheduling [abstract]
Atre, N., Sadok, H., and Sherry, J. (2024). BBQ: A Fast and Scalable Integer Priority Queue for Hardware Packet Scheduling. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI). Santa Clara, CA, USA: USENIX Association. [bibtex]

Abstract:
The need for fairness, strong isolation, and fine-grained control over network traffic in multi-tenant cloud settings has engendered a rich literature on packet scheduling in switches and programmable hardware. Recent proposals for hardware scheduling primitives (e.g., PIFO, PIEO, BMW-Tree) have enabled run-time programmable packet schedulers, considerably expanding the suite of scheduling policies that can be applied to network traffic. However, no existing solution can be practically \textit{deployed} on modern switches and NICs because they either do not scale to the number of elements required by these devices or fail to deliver good throughput, thus requiring an impractical number of replicas. In this work, we ask: is it possible to achieve priority packet scheduling at line-rate while supporting a large number of flows? Our key insight is to leverage a scheduling primitive used previously in software -- called Hierarchical Find First Set -- and port this to a highly pipeline-parallel hardware design. We present the architecture and implementation of the Bitmapped Bucket Queue (\system), a hardware-based integer priority queue that supports a wide range of scheduling policies (via a PIFO-like abstraction). BBQ, for the first time, supports hundreds of thousands of concurrent flows while guaranteeing 100\,Gbps line rate (148.8\,Mpps) on FPGAs and 1\,Tbps (1,488\,Mpps) line rate on ASICs. We demonstrate this by implementing BBQ on a commodity FPGA where it is capable of supporting over 100K flows and 32K priorities at 300\,MHz, $3\times$ the packet rate of similar hardware priority queue designs. On ASIC, we can synthesize 100K elements at 3.1\,GHz using a 7nm process.
BibTeX:
```
@inproceedings {bbq,
  author = {Atre, Nirav and Sadok, Hugo and Sherry, Justine},
  title = {{BBQ}: A Fast and Scalable Integer Priority Queue for Hardware Packet Scheduling},
  booktitle = {21st {USENIX} Symposium on Networked Systems Design and Implementation},
  year = {2024},
  address = {Santa Clara, CA},
  publisher = {{USENIX} Association},
  month = apr,
  series = {{NSDI}~'24}
}
```

2023

Of Apples and Oranges: Fair Comparisons in Heterogenous Systems Evaluation [abstract] [pdf] [slides]
Sadok, H., Panda, A., and Sherry, J. (2023). Of Apples and Oranges: Fair Comparisons in Heterogenous Systems Evaluation. In Proceedings of the 22nd Workshop on Hot Topics in Networks (HotNets). Boston, MA, USA: Association for Computing Machinery. [bibtex]

Abstract:
Accelerators, such as GPUs, SmartNICs and FPGAs, are common components of research systems today. This paper focuses on the question of how to fairly compare these systems. This is challenging because it requires comparing systems that use different hardware, e.g., two systems that use two different types of accelerators, or comparing a system that uses an accelerator with one that does not. We argue that fair evaluation in this case requires reporting not just performance, but also the cost of competing systems. We discuss what cost metrics should be used, and propose general principles for incorporating cost in research evaluations.
BibTeX:
```
@inproceedings{apples_oranges,
author = {Sadok, Hugo and Panda, Aurojit and Sherry, Justine},
title = {Of Apples and Oranges: Fair Comparisons in Heterogenous Systems Evaluation},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626111.3628186},
doi = {10.1145/3626111.3628186},
booktitle = {Proceedings of the 22nd Workshop on Hot Topics in Networks},
pages = {1--8},
location = {Boston, Massachusetts},
month = nov,
series = {{HotNets}~'23}
}
```
Ensō: A Streaming Interface for NIC-Application Communication [abstract] [pdf] [slides] [video] [opensource]
Sadok, H., Atre, N., Zhao, Z., Berger, D. S., Hoe, J., Panda, A., Sherry, J., and Wang, R. (2023). Ensō: A Streaming Interface for NIC-Application Communication. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Boston, MA, USA: USENIX Association. [bibtex]

Abstract:
Today, most communication between the NIC and software involves exchanging fixed-size packet buffers. This packetized interface was designed for an era when NICs implemented few offloads and software implemented the logic for translating between application data and packets. However, both NICs and networked software have evolved: modern NICs implement hardware offloads, e.g., TSO, LRO, and serialization offloads that can more efficiently translate between application data and packets. Furthermore, modern software increasingly batches network I/O to reduce overheads. These changes have led to a mismatch between the packetized interface, which assumes that the NIC and software exchange fixed-size buffers, and the features provided by modern NICs and used by modern software. This incongruence between interface and data adds software complexity and I/O overheads, which in turn limits communication performance. This paper proposes Ensō, a new streaming NIC-to-software interface designed to better support how NICs and software interact today. At its core, Ensō eschews fixed-size buffers, and instead structures communication as a stream that can be used to send arbitrary data sizes. We show that this change reduces software overheads, reduces PCIe bandwidth requirements, and leads to fewer cache misses. These improvements allow an Ensō-based NIC to saturate a 100 Gbps link with minimum-sized packets (forwarding at 148.8 Mpps) using a single core, improve throughput for high-performance network applications by 1.5-6x, and reduce latency by up to 43%.
BibTeX:
```
@inproceedings {enso,
author = {Sadok, Hugo and Atre, Nirav and Zhao, Zhipeng and Berger, Daniel S. and Hoe, James C. and Panda, Aurojit and Sherry, Justine and Wang, Ren},
title = {{Ensō}: A Streaming Interface for {NIC}-Application Communication},
booktitle = {17th {USENIX} Symposium on Operating Systems Design and Implementation},
year = {2023},
isbn = {978-1-939133-34-2},
address = {Boston, MA},
pages = {1005--1025},
publisher = {{USENIX} Association},
month = jul,
series = {{OSDI}~'23}
}
```
DREAMPlaceFPGA-PL: An Open-Source GPU-Accelerated Packer-Legalizer for Heterogeneous FPGAs [abstract]
R. S. Rajarathnam, Z. Jiang, M. A. Iyer, and D. Z. Pan, DREAMPlaceFPGA-PL: An Open-Source GPU-Accelerated Packer-Legalizer for Heterogeneous FPGAs. In Proceedings of the International Symposium on Physical Design(ISPD). [bibtex]

Abstract:
Placement plays a pivotal and strategic role in the FPGA implementation flow to allocate the physical locations of the heterogeneous instances in the design. Among the placement stages, the packing or clustering stage groups logic instances like look-up tables (LUTs) and flip-flops (FFs) that could be placed on the same site. The legalization stage determines all instances’ physical site locations. With advances in FPGA architecture and technology nodes, designs contain millions of logic instances, and placement algorithms must scale accordingly. While other placement stages - global placement and detailed placement, have been accelerated using GPUs, the acceleration of packing and legalization stages on a GPU remains largely unexplored. This work presents DREAMPlaceFPGA-PL, an open-source packer-legalizer for heterogeneous FPGAs that employs GPU for acceleration. We revise the existing consensus-based parallel algorithms employed for packing and legalizing a flat placement to obtain further speedup on a GPU. Our experiments on the ISPD’2016 benchmarks demonstrate more than 2× acceleration.
BibTeX:
```
@inproceedings{dreamplacefpga-pl-ispd2023,
author={Rajarathnam, Rachel Selina and Jiang, Zixuan and Iyer, Mahesh A. and Pan, David Z.},
booktitle={International Symposium on Physical Design (ISPD)}, 
title={DREAMPlaceFPGA-PL: An Open-Source GPU-Accelerated Packer-Legalizer for Heterogeneous FPGAs}, 
year={2023},
volume={},
number={},
pages={},
numpages = {},
doi={},
month = march
}
```

2022

SurgeProtector: Mitigating Temporal Algorithmic Complexity Attacks using Adversarial Scheduling [pdf]
Atre, N., Sadok, H., Chiang, E., Wang, W., and Sherry, J. SurgeProtector: Mitigating Temporal Algorithmic Complexity Attacks using Adversarial Scheduling. In Proceedings of the 2022 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM).
RLPlace: Using Reinforcement Learning and Smart Perturbations to Optimize FPGA Placement
Elgammal, M., Murray, K., and Betz, V. RLPlace: Using Reinforcement Learning and Smart Perturbations to Optimize FPGA Placement. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
RAD-Sim: Rapid Architecture Exploration for Novel Reconfigurable Acceleration Devices
Boutros, A., Nurvitadhi, E., and Betz. V. RAD-Sim: Rapid Architecture Exploration for Novel Reconfigurable Acceleration Devices. In Proceedings of Int. Conf. on Field-Programmable Logic and Applications.
Exloiting the Common Case when Accelerating Input-Dependent Stream Processing by FPGA
Zhao, Z., Melber, J., Sahay, S., Obla, S., Nurvitadhi, E., and Hoe, J. Exloiting the Common Case when Accelerating Input-Dependent Stream Processing by FPGA. To appear IEEE Transactions on Computers.
High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-design
Anupreetham, A., Mohamed, I., Hall, M., Boutros, A., Kuzhivley, A., Mohanty, A., Nurvitadhi, E., Betz, V., Cao, Y., and Seo, J. High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-design. Submitted to IEEE Transactions on VLSI.
Architecture and Application Co-Design for Beyond-FPGA Reconfigurable Acceleration Devices
Boutros, A., Nurvitadhi, E., and Betz. V. Architecture and Application Co-Design for Beyond-FPGA Reconfigurable Acceleration Devices. Submitted to IEEE Access.
DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit [abstract]
R. S. Rajarathnam, M. B. Alawieh, Z. Jiang, M. Iyer and D. Z. Pan, DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit. In Proceedings of the 27th Asia and South Pacific Design Automation Conference (ASP-DAC). [bibtex]

Abstract:
Modern Field Programmable Gate Arrays (FPGAs) are large-scale heterogeneous programmable devices that enable high performance and energy efficiency. Placement is a crucial and computationally intensive step in the FPGA design flow that determines the physical locations of various heterogeneous instances in the design. Several works have employed GPUs and FPGAs to accelerate FPGA placement and have obtained significant runtime improvement. However, with these approaches, it is a non-trivial effort to develop optimized and algorithmic-specific kernels for GPU and FPGA to realize the best acceleration performance. In this work, we present DREAMPlaceFPGA, an open-source deep-learning toolkit-based accelerated placement framework for large-scale heterogeneous FPGAs. Notably, we develop new operators in our framework to handle heterogeneous resources and FPGA architecture-specific legality constraints. The proposed framework requires low development cost and provides an extensible framework to employ different placement optimizations. Our experimental results on the ISPD'2016 benchmarks show very promising results compared to prior approaches.
BibTeX:
```
@inproceedings{dreamplacefpga-aspdac2022,
author={Rajarathnam, Rachel Selina and Alawieh, Mohamed Baker and Jiang, Zixuan and Iyer, Mahesh and Pan, David Z.},
booktitle={27th Asia and South Pacific Design Automation Conference (ASP-DAC)}, 
title={DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit}, 
year={2022},
volume={},
number={},
pages={300-306},
numpages = {7},
doi={10.1109/ASP-DAC52403.2022.9712562},
month = jan
}
```

2021

Specializing for Efficiency: Customizing AI Inference Processors on FPGAs
Boutros, A., Nurvitadhi, E., and Betz. V. Specializing for Efficiency: Customizing AI Inference Processors on FPGAs. In Proceedings of IEEE Int. Conf. on Microelectronics (ICM).
End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression [abstract]
A. Na, M. Ibrahim, M. Hall, A. Boutros, A. Mohanty, E. Nurvitadhi, V. Betz, Y. Cao and J. Seo. (2021). End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression. In Proceedings of the Int. Conf on Field Programmable Logic and Applications (FPL). [bibtex]

Abstract:
Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very high accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSD-MobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13 microseconds to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5x higher throughput and 4.4x lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.
BibTeX:
```
@inproceedings{hpipe-nms-fpl21,
author = {Na, A. and Ibrahim, M. and Hall, M. and Boutros, A. and Mohanty, A. and Nurvitadhi, E. and Betz, V. and Cao, Y. and Seo, J.},
title = {End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression},
year = {2021},
isbn = {},
booktitle = {Proceedings of the International Conference on Field-Programmable Logic and Applications},
pages = {1–8,
numpages = {8},
month = aug
}
```
DO-GPU: Domain Optimizable Soft GPUs
R. Ma, J. Hsu, T. Tan, E. Nurvitadhi, R. Vivekanandham, A. Dasu, M. Langhammer, and D. Chiou. (2021). DO-GPU: Domain Optimizable Soft GPUs. In Proceedings of International Conference on Field-Programmable Logic and Applications (FPL).
Pigasus: Efficient Handling of Input-Dependent Streaming on FPGAs [pdf]
Z. Zhao. (2021). Pigasus: Efficient Handling of Input-Dependent Streaming on FPGAs. PhD Thesis, ECE, Carnegie Mellon University.
Fluid: Raising the Level of Abstraction for FPGA Accelerator Development Without Compromising Performance [pdf]
J. Melber. (2021). Fluid: Raising the Level of Abstraction for FPGA Accelerator Development Without Compromising Performance. PhD Thesis, ECE, Carnegie Mellon University.
We Need Kernel Interposition over the Network Dataplane [abstract] [pdf]
Sadok, H., Zhao, Z., Choung, V., Atre, N., Berger, D. S., Hoe, J. C., Panda, A., and Sherry, J. (2021). We Need Kernel Interposition over the Network Dataplane. In Proceedings of the Workshop on Hot Topics in Operating Systems. [bibtex]

Abstract:
Kernel-bypass networking, which allows applications to circumvent the kernel and interface directly with NIC hardware, is one of the main tools for improving application network performance. However, allowing applications to circumvent the kernel makes it impossible to use tools (e.g., tcpdump) or impose policies (e.g., QoS and filters) that need to interpose on traffic sent by different applications running on a host. This makes maintainability and manageability a challenge for kernel-bypass applications. In response, we propose Kernel On-Path Interposition (KOPI), in which traditional kernel dataplane functionality is retained but implemented in a fully programmable SmartNIC. We hypothesize that KOPI can support the same tools and policies as the kernel stack while retaining the performance benefits of kernel bypass.
BibTeX:
```
@inproceedings{Sadok2021,
author = {Sadok, Hugo and Zhao, Zhipeng and Choung, Valerie and Atre, Nirav and Berger, Daniel S. and Hoe, James C. and Panda, Aurojit and Sherry, Justine},
title = {We Need Kernel Interposition over the Network Dataplane},
year = {2021},
isbn = {},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the Workshop on Hot Topics in Operating Systems},
pages = {1–6},
numpages = {6},
month = jun,
series = {{HotOS}~'21}
}
```
FlexScore: Quantifying Flexibility
T. Tan, E. Nurvitadhi, A. Dasu, M. Langhammer, and D. Chiou. (2021). FlexScore: Quantifying Flexibility. IEEE Computer Architecture Letters.
System Level Tradeoffs Between ASIC and FPGA Accelerators
T. Tan. (2021). System Level Tradeoffs Between ASIC and FPGA Accelerators. PhD Thesis, ECE, University of Austin.

2020

From TensorFlow Graphs to LUTs and Wires: Automated Sparse and Physically Aware CNN Hardware Generation
Bets, V. and Hall, M. From TensorFlow Graphs to LUTs and Wires: Automated Sparse and Physically Aware CNN Hardware Generation. In Proceedings of IEEE Conf. on Field-Programmable Technology.
Beyond Peak Performance: Comparing The Real Performance of AI-Optimized FPGAs and GPUs
Boutros, A., Nurvitadhi, E., Ma, R., Gribok, S., Langhammer, M., Zhao, Z., Hoe, J., and Betz. V. Beyond Peak Performance: Comparing The Real Performance of AI-Optimized FPGAs and GPUs. In Proceedings of IEEE Conf. on Field-Programmable Technology.
Achieving 100Gbps Intrusion Prevention on a Single Server [abstract] [pdf] [slides] [video] [opensource]
Zhao, Z., Sadok, H., Atre, N., Hoe, J., Sekar, V., and Sherry, J. (2020). Achieving 100Gbps Intrusion Prevention on a Single Server. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Berkeley, CA, USA: USENIX Association. [bibtex]

Abstract:
Pigasus is an 100Gbps Intrusion Detection and Prevention System that can inspect network traffic by checking against 10K+ rules with the support of 100K+ concurrent connections. Pigasus is implemented on a single server using one FPGA-based SmartNIC with a few CPU cores, saving hundreds of cores compared with CPU-only approach. The Github repository contains the FPGA RTL code, CPU full matcher code and scripts for RTL simulation, synthesis build and hardware onboard test.
BibTeX:
```
@inproceedings {258923,
author = {Zhipeng Zhao and Hugo Sadok and Nirav Atre and James C. Hoe and Vyas Sekar and Justine Sherry},
title = {Achieving 100Gbps Intrusion Prevention on a Single Server},
booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {1083--1100},
url = {https://www.usenix.org/conference/osdi20/presentation/zhao-zhipeng},
publisher = {{USENIX} Association},
month = nov,
}
```
VTR 8: High Performance CAD and Customizable FPGA Architecture Modeling
Murray, K., Petelin, O., Zhong, S., Wang, J., Eldafrawy, M., Legault, J., Sha, E., Graham, A., Wu, J., Walker, M., Zeng, H., Patros, P., Luu, J., Kent, K., and Betz. V. VTR 8: High Performance CAD and Customizable FPGA Architecture Modeling. ACM Trans. on Reconfigurable Technology and Systems.