FPGA Technology at Crossroads

Field Programmable Gate Arrays (FPGAs) have been undergoing rapid and dramatic changes fueled by their expanding use in datacenter computing. Rather than serving as a compromise or alternative to ASICs, FPGA ‘programmable logic’ is emerging as a third paradigm of compute that stands apart from traditional hardware vs. software archetypes. A multi-university, multi-disciplinary team has been formed behind the question:

What should be the future role of FPGAs as a central function in datacenter servers?

Guided by both the demands of modern networked, data-centric computing and the new capabilities from 3D integration, the Intel/VMware Crossroads 3D-FPGA Academic Research Center will investigate a new programmable hardware data-nexus lying at the heart of the server and operating over data ‘on the move’ between network, traditional compute, and storage elements.

The Intel/VMware Crossroads 3D-FPGA Academic Research Center is jointly supported by Intel and VMware. The center is committed to public and free dissemination of its research outcome.


You can find an overview presentation on the center’s YouTube channel. Please contact any of the Crossroads PIs in your research area if you have any questions or interest.

If you are looking for an introductory overview on FPGAs, you may find the first 4 lectures from this course useful. Please see FPGA Architecture: Principles and Progression by Boutros and Betz for a technical overview article. You can find a wide range of FPGA topics presented to different skill levels on this Intel YouTube Channel.


Latest News

February 2022 | Intel’s Corporate Research Council recognizes Crossroads Center PIs Sherry, Sekar and Hoe with 2021 Outstanding Researcher Awards for their work on the Pigasus FPGA-Accelerated Intrusion Detection and Prevention System. Pigasus inspects 100k+ concurrent connections against 10k+ SNORT rules at 100 Gbps in a single server form factor by handling common-case processing in an Intel FPGA SmartNIC. Pigasus was developed by former CMU PhD student Dr. Zhipeng Zhao in his dissertation on efficient acceleration of irregular, data-dependent stream processing. Today, Pigasus is a focus application driver for many technologies under research by the Crossroads Center. Pigasus has gained broad interest as an open-sourced project with a growing academic and industrial user and developer community.

February 2022 | Intel’s Corporate Research Council recognizes Crossroads Center PIs Sherry, Sekar and Hoe with 2021 Outstanding Researcher Awards for their work on the Pigasus FPGA-Accelerated Intrusion Detection and Prevention System. Pigasus inspects 100k+ concurrent connections against 10k+ SNORT rules at 100 Gbps in a single server form factor by handling common-case processing in an Intel FPGA SmartNIC. Pigasus was developed by former CMU PhD student Dr. Zhipeng Zhao in his dissertation on efficient acceleration of irregular, data-dependent stream processing. Today, Pigasus is a focus application driver for many technologies under research by the Crossroads Center. Pigasus has gained broad interest as an open-sourced project with a growing academic and industrial user and developer community. (Read Less)


December, 2021 | Mohamed Ibrahim successfully defended his MASc thesis at the University of Toronto. His thesis detailed enhancements to the HPIPE FPGA-based CNN accelerator to perform object detection and to span multiple FPGAs for higher performance. Mohamed developed an automatic partitioning algorithm that allows HPIPE accelerators to achieve higher parallelism by spanning multiple FPGAs. Both performance models and deployment on a multi-Stratix-10 system in James Hoe’s group at CMU showed near-linear speedup as the FPGA count increased. Mohamed will join Intel’s Deep Learning Accelerator team in February.

December, 2021 | Mohamed Ibrahim successfully defended his MASc thesis at the University of Toronto. His thesis detailed enhancements to the HPIPE FPGA-based CNN accelerator to perform object detection and to span multiple FPGAs for higher performance. Mohamed developed an automatic partitioning algorithm that allows HPIPE accelerators to achieve higher parallelism by spanning multiple FPGAs. Both performance models and deployment on a multi-Stratix-10 system in James Hoe’s group at CMU showed near-linear speedup as the FPGA count increased. Mohamed will join Intel’s Deep Learning Accelerator team in February. (Read Less)


December 2021 | The paper “Specializing for Efficiency: Customizing AI Inference Processors on FPGAs,” by by Andrew Boutros, Vaughn Betz (University of Toronto) and Eriko Nurvitadhi (Intel) received the “Third Paper Award” from the IEEE International Conference on Microelectronics. This work showed that specializing NPU accelerators to workload classes improves performance by 9% to 35% while simultaneously reducing resource usage by 23% to 44%. Andrew is currently augmenting the SystemC NPU model to investigate dividing the NPU into modular latency-insensitive components; this will enable investigation of Crossroads FPGA architecture ideas that include linking components and accelerators with a (latency-insensitive) NoC.

December 2021 | The paper “Specializing for Efficiency: Customizing AI Inference Processors on FPGAs,” by by Andrew Boutros, Vaughn Betz (University of Toronto) and Eriko Nurvitadhi (Intel) received the “Third Paper Award” from the IEEE International Conference on Microelectronics. This work showed that specializing NPU accelerators to workload classes improves performance by 9% to 35% while simultaneously reducing resource usage by 23% to 44%. Andrew is currently augmenting the SystemC NPU model to investigate dividing the NPU into modular latency-insensitive components; this will enable investigation of Crossroads FPGA architecture ideas that include linking components and accelerators with a (latency-insensitive) NoC. (Read Less)


[Find all News here]


Recent Publications

  • DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit [abstract]

    R. S. Rajarathnam, M. B. Alawieh, Z. Jiang, M. Iyer and D. Z. Pan, "DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit," 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 2022. [bibtex]

    Abstract:
    Modern Field Programmable Gate Arrays (FPGAs) are large-scale heterogeneous programmable devices that enable high performance and energy efficiency. Placement is a crucial and computationally intensive step in the FPGA design flow that determines the physical locations of various heterogeneous instances in the design. Several works have employed GPUs and FPGAs to accelerate FPGA placement and have obtained significant runtime improvement. However, with these approaches, it is a non-trivial effort to develop optimized and algorithmic-specific kernels for GPU and FPGA to realize the best acceleration performance. In this work, we present DREAMPlaceFPGA, an open-source deep-learning toolkit-based accelerated placement framework for large-scale heterogeneous FPGAs. Notably, we develop new operators in our framework to handle heterogeneous resources and FPGA architecture-specific legality constraints. The proposed framework requires low development cost and provides an extensible framework to employ different placement optimizations. Our experimental results on the ISPD'2016 benchmarks show very promising results compared to prior approaches.
    BibTeX:
    @inproceedings{dreamplacefpga-aspdac2022,
    author={Rajarathnam, Rachel Selina and Alawieh, Mohamed Baker and Jiang, Zixuan and Iyer, Mahesh and Pan, David Z.},
    booktitle={27th Asia and South Pacific Design Automation Conference (ASP-DAC)}, 
    title={DREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit}, 
    year={2022},
    volume={},
    number={},
    pages={300-306},
    numpages = {7},
    doi={10.1109/ASP-DAC52403.2022.9712562},
    month = jan
    }
    
  • End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression [abstract]

    A. Na, M. Ibrahim, M. Hall, A. Boutros, A. Mohanty, E. Nurvitadhi, V. Betz, Y. Cao & J. Seo. (2021). End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression. In Proceedings of the Int. Conf on Field Programmable Logic and Applications (FPL). [bibtex]

    Abstract:
    Object detection is an important computer vision task, with many applications in autonomous driving, smart surveillance, robotics, and other domains. Single-shot detectors (SSD) coupled with a convolutional neural network (CNN) for feature extraction can efficiently detect, classify and localize various objects in an input image with very high accuracy. In such systems, the convolution layers extract features and predict the bounding box locations for the detected objects as well as their confidence scores. Then, a non-maximum suppression (NMS) algorithm eliminates partially overlapping boxes and selects the bounding box with the highest score per class. However, these two components are strictly sequential; a conventional NMS algorithm needs to wait for all box predictions to be produced before processing them. This prohibits any overlap between the execution of the convolutional layers and NMS, resulting in significant latency overhead and throughput degradation. In this paper, we present a novel NMS algorithm that alleviates this bottleneck and enables a fully-pipelined hardware implementation. We also implement an end-to-end system for low-latency SSD-MobileNet-V1 object detection, which combines a state-of-the-art deeply-pipelined CNN accelerator with a custom hardware implementation of our novel NMS algorithm. As a result of our new algorithm, the NMS module adds a minimal latency overhead of only 0.13 microseconds to the SSD-MobileNet-V1 convolution layers. Our end-to-end object detection system implemented on an Intel Stratix 10 FPGA runs at a maximum operating frequency of 350 MHz, with a throughput of 609 frames-per-second and an end-to-end batch-1 latency of 2.4 ms. Our system achieves 1.5x higher throughput and 4.4x lower latency compared to the current state-of-the-art SSD-based object detection systems on FPGAs.
    BibTeX:
    @inproceedings{hpipe-nms-fpl21,
    author = {Na, A. and Ibrahim, M. and Hall, M. and Boutros, A. and Mohanty, A. and Nurvitadhi, E. and Betz, V. and Cao, Y. and Seo, J.},
    title = {End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression},
    year = {2021},
    isbn = {},
    booktitle = {Proceedings of the International Conference on Field-Programmable Logic and Applications},
    pages = {1–8,
    numpages = {8},
    month = aug
    }
    
  • DO-GPU: Domain Optimizable Soft GPUs

    R. Ma, J. Hsu, T. Tan, E. Nurvitadhi, R. Vivekanandham, A. Dasu, M. Langhammer, & D. Chiou. (2021). DO-GPU: Domain Optimizable Soft GPUs. In Proceedings of International Conference on Field-Programmable Logic and Applications (FPL).

[Find all Publications and Downloads here]