Crossroads Seminars

The Crossroads seminar series is offered regularly on Fridays 2~3pm (US eastern). The seminar series will feature the latest research results by the center's PIs and students, as well as a diverse range of talks including informal work-in-progress and invited outside speakers.

Upcoming Seminars

Past Seminars

Portrait of Moein Khazraee
Friday, November 17, 2023 | 2pm~3pm ET

Lowering the barrier to entry to use customized hardware for cloud systems
Moein Khazraee, Nvidia

Abstract: With the increasing demand in cloud computing, there is a vital need to efficiently transfer the data and process them on high-performance and scalable systems within the data center. However, network bandwidth is outpacing our ability to process packets in software, forcing cloud providers to resort to specialized hardware. Unfortunately, hardware development is an inherently intricate, laborious, and costly procedure. Furthermore, integrating specialized hardware into a networked application requires hardware-software co-design, exacerbating the situation as developers with markedly different specializations have to collaborate.

In this talk, I will discuss how we can build frameworks to systematically tackle these challenges and lower the barrier to entry of hardware customization in cloud systems. Then, I will focus on two frameworks: Rosebud for wired, and SparSDR for wireless networks in base stations. The Rosebud framework brings software-like control and debugging to FPGA-based middleboxes, which enabled us to port the state-of-the-art intrusion detector in less than a month and double its throughput to 200 Gbps. The SparSDR framework makes the backhaul and computation of Software-Defined Radios more efficient while maintaining their universality. It enables backhauling a 100 MHz frequency band over only 224 Mbps instead of 3.2 Gbps, and decoding BLE packets in real-time on a low-end processor.
Bio: Moein Khazraee is a Senior Architect at NVIDIA, focusing on applied research in networking for large scale systems, such as high performance computing and machine learning. Previously, he was a postdoctoral research associate in MIT Computer Science & Artificial Intelligence Laboratory, where he focused on network optimizations for machine learning, as well as benefiting from the rising Silicon Photonics technology to scale performance beyond single-chip limitations. He received his PhD in Computer Science and Engineering from UC San Diego.

His research interests lie primarily in the intersection of network systems and computer architecture. He has worked on bringing the hardware customization to different parts of the cloud infrastructure, such as building data-centers from ASICs, co-optimizing network topology and ML parallelization strategy, simplifying FPGA development for high-bandwidth network middleboxes, and developing backhaul and compute-efficient software-defined radios for mobile base stations.

Portrait of Pierre-Emmanuel Gaillardon
Friday, November 3, 2023 | 2pm~3pm ET

Under the Hood of OpenFPGA
Pierre-Emmanuel Gaillardon, The University of Utah

Abstract: In this talk, we will introduce the OpenFPGA framework. whose aim is to generate highly customizable Field Programmable Gate Array (FPGA) fabrics and their supporting EDA Rows. Following the footsteps of the RISC-V initiative, OpenFPGA bring reconfigurable logic into the open-source community and closes the performance gap with commercial products. OpenFPGA strongly incorporates physical design automation in its core and enables l00k+ look-up tables FPGA fabric generation from specification to layout in less than 24h with a single engineer effort.
Bio: Pierre-Emmanuel Gaillardon (S’10–M’11–SM’16) is an Associate Professor and the Associate Chair for Academics and Strategic Initiatives in the Electrical and Computer Engineering (ECE) department and an adjunct Associate Professor in the School of Computing at The University of Utah, Salt Lake City, UT, where he leads the Laboratory for NanoIntegrated Systems (LNIS). He holds an Electrical Engineer M.Sc. degree from CPE-Lyon, France (2008), a M.Sc. degree in Electrical Engineering from INSA Lyon, France (2008) and a Ph.D. degree in Electrical Engineering from CEA-LETI, Grenoble, France and the University of Lyon, France (2011).

Prior to joining the University of Utah, he was a research associate at the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland within the Laboratory of Integrated Systems (Prof. De Micheli) and a visiting research associate at Stanford University, Palo Alto, CA, USA. Previously, he was research assistant at CEA-LETI, Grenoble, France. Prof. Gaillardon is recipient of the C-Innov 2011 best thesis award, the Nanoarch 2012 best paper award, the BSF 2017 Prof. Pazy Memorial Research Award, the 2017 NSF CAREER award, the 2018 IEEE CEDA Pederson Award, the 2018 ChemE Education William H. Corcoran best paper award, the 2019 DARPA Young Faculty Award, the 2019 IEEE CEDA Ernest S. Kuh Early Career Award, the 2020 ACM SIGDA Outstanding New Faculty Award, the 2022 ECE Department Research Faculty Award and the 2023 DARPA under-40 Innovators Award He has been serving as TPC member for many conferences, including DATE, DAC, ICCAD, Nanoarch, etc.. He is an associate editor of IEEE TNANO and a reviewer for several journals and funding agencies. He served as Topic co-chair "Emerging Technologies for Future Memories" for DATE'17-19. He is a senior member of the IEEE.

The research activities and interests of Prof. Gaillardon are currently focused on the development of novel computing systems exploiting emerging device technologies and novel EDA techniques.

Portrait of Eriko Nurvitadhi
Thursday, November 2, 2023 | 11am~12pm ET (Note special day and time)

Evading Datacenter Tax Using MangoBoost’s Customizable FPGA Data Processing Units
Eriko Nurvitadhi, Mangoboost

Abstract: Modern datacenter servers rely on an increasing number of devices to improve efficiency in data-centric tasks, such as data storage (SSDs), movement (NICs), and processing (GPU, NPU, etc). Moreover, they offer advanced infrastructure for users to access resources easily/flexibly (e.g., via virtual machines, containers) and deploy/scale applications (e.g., via web/microservices). Managing such servers with a rich set of devices, while running sophisticated infrastructure software, impose a growing burden on CPUs and introduce substantial system overheads, often known as “Datacenter Tax.” MangoBoost’s novel patent-pending technologies offer comprehensive accelerator building blocks to offload a myriad of datacenter taxes, with a customizable framework to produce specialized data processing units (DPUs) to dramatically improve server systems performance, scalability, and cost. These custom DPUs are optimized for desired system targets and readily deployable on production-qualified off-the-shelf FPGAs.

This talk will discuss datacenter trends, the problem with datacenter tax, and MangoBoost’s DPU solutions to address this problem. To highlight the effectiveness of MangoBoost’s solutions, the talk will present case studies in networked storage and AI systems. For example, applying MangoBoost DPU to Samsung’s Petabyte SSD storage system enables 400GbE NVMe/TCP remote storage access at 90%+ of 400GbE line rate, achieving 3x higher throughput over existing solutions (which fall significantly short of line rate), while reducing host CPU usage by up to 95%, resulting in estimated of 20% total cost of ownership (TCO) savings. For AI, applying MangoBoost DPU to an AI training system that accesses datasets via remote storage leads to a major improvement in system performance for MLPerf Storage Benchmark. Finally, the talk will discuss opportunities for researchers to collaborate and use MangoBoost DPUs for research.
Bio: Dr. Nurvitadhi is a co-founder and the Chief Product Officer of MangoBoost, Inc. that offers novel customizable data processing unit (DPU) solutions to boost server systems performance and efficiency. MangoBoost, Inc. is a well-funded start-up which has received 65M$ seed and Series-A funds to grow (and actively hiring). Dr. Nurvitadhi’s interests are in hardware accelerator architectures, systems, and software for key application domains (e.g., AI, analytics). Previously, he was a Principal Engineer at Intel, focused on FPGAs, accelerators, and AI technologies. He has 70+ peer-reviewed publications, 120+ patents granted/pending, with H-index of 33. In 2020, he was recognized as a top 30 inventor by Intel Patent Group and received a Mahboob Khan Outstanding Liaison Award from SRC. He has served on program committees of IEEE/ACM conferences, and as the Technical Program Chair for FCCM 2022. He received a PhD in ECE from Carnegie Mellon University, and an MBA from Oregon State University.

Portrait of Andrew Boutros
Friday, October 13, 2023 | 2pm~3pm ET

CAD and Architecture Exploration Tools for Next-Generation Reconfigurable Acceleration Devices
Andrew Boutros, University of Toronto

Abstract: Field-programmable gate arrays (FPGAs) have evolved beyond a fabric of soft logic and hard blocks surrounded by programmable routing to also incorporate high-performance networks-on-chip (NoCs), general-purpose processor cores and application-specific accelerators. These new reconfigurable acceleration devices (RADs) open up a myriad of architecture research questions that require enhancing existing FPGA computer-aided design (CAD) tools and building new architecture evaluation tools for such complex devices. In the first part of this talk, we will present our new NoC-aware versatile place and route (VPR) flow that co-optimizes traditional circuit implementation metrics (e.g. wirelength, critical path delay) and NoC performance metrics (e.g. congestion, bandwidth utilization, latency) when mapping an application design with NoC-attached modules to a RAD’s NoC-enhanced FPGA fabric. Then, we will give an overview of our RAD architecture exploration and evaluation flow that consists of a system simulator (RAD-Sim) for evaluating performance of a candidate RAD, and a system implementation tool (RAD-Gen) for estimating the silicon implementation area and speed of key RAD architecture components. We showcase the tools using a case study on deep learning recommendation models (DLRMs), showing that a RAD with balanced NoC and hard matrix-vector multiplication engines can achieve up to 20x higher performance than current FPGAs.
Bio: Andrew Boutros is a PhD student in the ECE department at the University of Toronto under the supervision of Prof. Vaughn Betz. His research interests are the intersection of FPGA architecture/CAD and AI acceleration. He is a post-graduate affiliate of the Intel/VMware Crossroads 3D-FPGA Academic Research Center and the International Center for Spatial Computational Learning. He is also a machine learning systems engineer at MangoBoost. From 2018 to 2022, he was a research scientist at Intel Labs and Intel's Programmable Solutions Group CTO Office. He received his MASc in Computer Engineering from the University of Toronto in 2018, and his BSc in Electronics Engineering from the German University in Cairo in 2016.

Portrait of James C. Hoe
Friday, September 29, 2023 | 2pm~3pm ET

How hard is it to use an FPGA for compute acceleration in 2023?
James C. Hoe, Carnegie Mellon University

Abstract: In this talk I want to explore the question: how hard is it to use an FPGA in a computer system in 2023? Secondarily, there is the question: what application domain would most profit from FPGA acceleration if the historical programmability and usability challenges are removed. With advances in new single-source heterogeneous programming languages and high-level synthesis, I will argue that using an FPGA for compute acceleration is no harder than using a GPU through CUDA/OpenCL (which is by no means easy). Applying Intel DPC++/oneAPI to an interesting design example in high-throughput, low-latency streaming data analytics (i.e., aggregation), I will show that (1) we shouldn't automatically expect a loss of quality when using HLS to design for FPGAs; and, more importantly (2) the resulting HLS IPs can be much more maintainable and reusable, as well as being more accessible to application-level experts to assemble new streaming application pipelines. With an improved systematic programming flow, data stream processing---including online and offline transformation, inspection, and analytics---is one of the prime candidates to leverage FPGAs’ advantage over CPU/GPU/ASIC options in a computer system.
Bio: James C. Hoe is a Professor of Electrical and Computer Engineering at Carnegie Mellon University. He received his Ph.D. in EECS from Massachusetts Institute of Technology in 2000 (S.M., 1994). He received his B.S. in EECS from UC Berkeley in 1992. He is interested in many aspects of computer architecture and digital hardware design, including the specific areas of FPGA architecture for computing; digital signal processing hardware; and high-level hardware design and synthesis. He is a Fellow of IEEE. For more information, please visit

Portrait of Derek Chiou
Friday, September 15, 2023 | 2pm~3pm ET

Terminus: Moving the Center of Cloud Servers from Cores to SmartNICs and Beyond
Derek Chiou, The University of Texas at Austin

Abstract: Since the start of computing, server design has been core-centric. As infrastructure functionality, such as network virtualization, storage virtualization, encryption, etc. consumes more and more computational power, the center of the server is moving from cores/processors to SmartNICs. This talk motivates this move, describes how SmartNICs work today, and discusses future directions and challenges.
Bio: Derek Chiou is a professor in the Electrical and Computer Engineering Department at The University of Texas at Austin and a Partner Architect at Microsoft responsible for future infrastructure offload system architecture. He is a co-founder of the Microsoft Azure SmartNIC effort and led the Bing FPGA team to first deployment of Bing ranking on FPGAs. Until 2016, when he joined Microsoft, he was an associate professor at UT. Before UT, Dr. Chiou was a system architect at Avici Systems, a manufacturer of terabit core routers. Dr. Chiou received his Ph.D., S.M., and S.B. degrees in Electrical Engineering and Computer Science from MIT.

Portrait of Justine Sherry
Friday, April 21, 2023 | 2pm~3pm ET

Re-envisioning generic server architectures for I/O-driven compute
Justine Sherry, Carnegie Mellon University

Abstract: In this talk, I will explore how traditional server architectures are "CPU-driven" rather than "I/O-driven", and why this architecture is a poor for a wide range of networked applications. I will highlight three Crossroads RV1 projects targeting I/O driven compute, and discuss how the Crossroads 3D FPGA will help server architectures better match the need of datacenter applications.
Bio: Justine Sherry is an assistant professor at Carnegie Mellon University. Her interests are in software and hardware networked systems; her work includes middleboxes, FPGA packet processing, measurement, cloud computing, and congestion control. Dr. Sherry received her PhD (2016) and MS (2012) from UC Berkeley, and her BS and BA (2010) from the University of Washington. Her research has been awarded the VMware Systems Research Award, the Applied Networking Research Prize, a Google Faculty Research Award, the SIGCOMM doctoral dissertation award, the David J. Sakrison prize, and paper awards at USENIX NSDI and ACM SIGCOMM. She is a member of the DARPA ISAT Study Group and the SIGCOMM CARES Committee. Most importantly, she is always on the lookout for a great cappuccino.

Portrait of Kimia Talaei
Friday, April 14, 2023 | 2pm~3pm ET

Capturing Realistic Architectures for Field Programmable Gate Array Optimization
Kimia Talaei, University of Toronto

Abstract: In this talk, we will present a VPR-compatible architecture description of Intel's Stratix 10 device. This capture enables benchmarking and optimization of FPGA CAD flows on a more complex architecture, and serves as a baseline architecture on which researchers can evaluate architectural enhancements.

We will describe how the primitives, functional blocks, routing architecture and timing of Stratix 10 were captured and where approximations were made because of missing information or limitations in the architecture description format of VPR. We will discuss the performance of VPR in terms of quality of results, runtime, and resource utilization, as compared to Quartus Prime 21.2. Our results show a significant gap in packing and placement runtime, but a smaller overall runtime and smaller memory footprint for VPR. We also found that VPR trails Quartus in terms of wirelength and timing optimization. We investigate the major causes for VPR's slow packing runtime and propose potential future directions for improvements.
Bio: Kimia Talaei recently received her M.A.Sc. in Electrical and Computer Engineering at the University of Toronto and under the supervision of Prof. Vaughn Betz. Prior to University of Toronto, she received her B.Sc. degree in Computer Engineering from Sharif University of Technology, Iran. Her research interests include open-source FPGA CAD tools and potential applications of machine learning in this area.

Portrait of Peipei Zhou
(Invited) Friday, April 7, 2023 | 2pm~3pm ET

CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture
Peipei Zhou, University of Pittsburgh

Abstract: Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? In this talk, we will discuss CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator.
Bio: Peipei Zhou is an assistant professor of the Electrical Computer Engineering (ECE) department at the University of Pittsburgh. She has over 10 years of experience in hardware and software co-design. She has published 20+ papers in top-tier IEEE/ACM computer system and design automation conferences and journals including FPGA, FCCM, DAC, ICCAD, ISPASS, TCAD, TECS, TODAES, IEEE Micro, etc. The algorithm and tool proposed in her FCCM’18 paper have been realized in the commercial Vitis HLS (high-level synthesis) compiler from Xilinx (acquired by AMD in Feb 2022). Her work in FPGA acceleration for deep learning won the 2019 Donald O. Pederson Best Paper Award from the IEEE Council for Design Automation (CEDA). Her work in cloud-based application optimization won the 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Best Paper Nominee and her work in FPGA acceleration for computer vision won the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) Best Paper Nominee. Before joining Pitt, she worked as a full-time staff software engineer in a start-up company and led a team of 6 members to develop CNN and MM kernels in the deep learning libraries for two generations of AI training application-specific integrated circuit (ASIC) chip products.

Portrait of Jiaqi Gao
(Invited) Friday, March 31, 2023 | 2pm~3pm ET

Vela: Host-Side Uniform Programming Platform for Network Processing
Jiaqi Gao, Alibaba Group US

Abstract: With the growing performance requirements on networked applications, there is a new trend of offloading applications to SmartNICs. However, today, programmers have to make tedious efforts to understand the instruction sets of SmartNICs and optimize their performance manually. Moreover, they have to either manually partition the application between the NIC and the host for the best performance or rely on heuristics such as offloading as much as they can without considering the overhead of splitting states and flows. In this paper, we propose Vela, a language and compiler that enables automatic partitioning of programs to the NIC and hosts for high performance. Our evaluation shows that Vela for Netronome Agilio and BlueField2 SmartNICs can achieve high accuracy in performance prediction and propose partitioning plans with significant CPU savings.
Bio: Jiaqi Gao is a senior software engineer at Alibaba Group US. He received his Ph.D. from Harvard University. His advisor is Prof. Minlan Yu. His research interests include data center networks, distributed systems, and programmable devices.

Portrait of John Shalf
(Invited) Friday, March 24, 2023 | 2pm~3pm ET

The Future of Computing Beyond Moore’s Law
John Shalf, Lawrence Berkeley National Laboratory

Abstract: Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power, and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore’s Law for 50 years is failing and is anticipated to flatten by 2025. This presentation provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this presentation covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers.
Bio: John Shalf is Department Head for Computer Science Lawrence Berkeley National Laboratory, and recently was deputy director of Hardware Technology for the DOE Exascale Computing Project. Shalf is a coauthor of over 80 publications in the field of parallel computing software and HPC technology, including three best papers and the widely cited report "The Landscape of Parallel Computing Research: A View from Berkeley" (with David Patterson and others). He also coauthored the 2008 "ExaScale Software Study: Software Challenges in Extreme Scale Systems," which set the Defense Advanced Research Project Agency’s (DARPA's) information technology research investment strategy. Prior to coming to Berkeley Laboratory, John worked at the National Center for Supercomputing Applications and the Max Planck Institute for Gravitation Physics/Albert Einstein Institute (AEI) where he was was co-creator of the Cactus Computational Toolkit.

Portrait of Andrew Bitar
(Invited) Friday, December 9, 2022 | 2pm~3pm ET

Groq’s Software-Defined Hardware for Dataflow Compute
Andrew Bitar, Groq

Abstract: With the end of Dennard Scaling and explosion of data-flow compute in the domains of AI and HPC, there has been a new renaissance in domain specific architectures (DSAs) to help meet today’s compute demands. A large swath of these architectures are spatial in nature, where compute is unrolled in space to expose more parallelism for data-flow-heavy workloads. With these spatial architectures comes the challenge of effectively mapping workloads to the available compute units. Parallelizing compilers are often touted as the means to this goal, but their effectiveness is largely limited by the abstraction exposed by hardware to software. Here we explore the inherent challenges faced by some existing spatial architectures, such as GPUs, and explain how focusing on deterministic compute can alleviate these challenges. We do this by diving deep into Groq’s Tensor Streaming Processor (TSP), exploring how the architecture empowers software to efficiently map data-flow workloads to the chip’s massive amounts of compute. We demonstrate how this “software-defined hardware” approach is well-suited for data-flow compute, showcasing >5x improvements compared to current state-of-the-art on LSTM and Transformer-based models. We also explore how the compiler and architecture allow for powerful hardware-software co-design capabilities.
Bio: Andrew Bitar leads a team working on Groq’s novel Tensor Streaming Processor compiler, focused on ML and HPC workloads. Before joining Groq, he was a Technical Lead at Intel developing a FPGA Deep Learning Accelerator and compiler. Andrew received his MASc from the University of Toronto, where his research focus was on spatial architectures and applications for FPGAs.

Portrait of Marius Stan
Friday, December 2, 2022 | 2pm~3pm ET

HPIPE-NX: Leveraging Tensor Blocks for High-Performance CNN Inference Acceleration on FPGAs
Marius Stan, University of Toronto

Abstract: HPIPE is a state-of-the-art sparse-aware CNN accelerator for FPGAs. Through building deeply pipelined, customized hardware for every layer in the CNN and modelling the physical device characteristics, HPIPE can achieve very high compute density while maintaining high operating frequency. HPIPE also leverages sparsity, allowing it to skip multiplications with weights that are close to or equal to 0. HPIPE requires all parameters to fit on-chip, making it memory bound for larger networks like Resnet. Recent work has allowed HPIPE to be split over multiple chips, reducing this memory bottleneck, however there are opportunities to reduce it further by using lower data precisions. Smaller networks, such as Mobilenets-v1 to v3, have traditionally been compute bound due to the limited number of multipliers on-chip (DSPs each have 2 INT18 multipliers). The AI-optimized Stratix 10 NX can solve this issue with its novel tensor blocks (30 INT8 multipliers each). However, tensor blocks are harder to exploit, as one set of values must be pre-loaded into ping-pong registers while the other can be broadcast after.

In this talk, I will first briefly present HPIPE and how it produces highly efficient accelerators for CNNs. I will then discuss the work that was done to enhance HPIPE for the Stratix 10 NX and present simulation and FPGA results detailing the performance improvements. The tradeoffs of the different tensor block modes will be discussed, including their effects on sparsity support, memory utilization, and Fmax. We will also explore current bottlenecks on the FPGA implementation and potential ways of removing them in the future. The consequences of switching to a lower data precision will also be explored, with an analysis on using block floating-point mode on the tensor block to improve accuracy.
Bio: Marius is a second-year M.A.Sc. student in Computer Engineering at the University of Toronto advised by Prof. Vaughn Betz. His research focuses on developing machine learning accelerators for FPGAs, and exploring how FPGA architecture can be improved to better suit them. He recently finished an internship as a part of Intel’s programmable solutions group (PSG). Prior to doing his Master’s, he also completed his B.A.Sc. at the University of Toronto where he did an internship at AMD as part of their hardware video encoding team. After graduation he will be joining Microsoft on their AI FPGA team.

Portrait of Ang Li
(Invited) Friday, November 11, 2022 | 2pm~3pm ET

Efficient, Programmable, and Manufacturable Hardware: The Case for Synthesizable FPGAs
Ang Li, Princeton University

Abstract: Field Programmable Gate Arrays (FPGA) are being used in a fast-growing range of scenarios like cloud-scale AI engines and reconfigurable accelerators, while heterogeneous CPU-FPGA systems are being tapped as a possible way to mitigate the challenges posed by the end of Moore’s Law. This growth in diverse use cases has fueled the need to customize FPGA architectures for particular applications or application domains. If FPGAs are to become a universal computing fabric like general-purpose processors, they must be technology-agnostic, flexible in architecture, and adaptable to physical design constraints.

This talk will give an overview of Princeton Reconfigurable Gate Array (PRGA), an open-source framework for building customized, synthesizable FPGAs with bespoke, RTL-to-bitstream toolchains. I will present three prototype system-on-chip (SoC) tape-outs that each integrate a different PRGA instance and pose a unique challenge to architecture-VLSI co-design. We will also briefly discuss several novel uses of FPGAs, including fine-grained, manycore-eFPGA integration and the idea of using reconfigurable fabric as the substrate for domain-specialized hardware.
Bio: Ang Li is a Ph.D. candidate in the department of Electrical and Computer Engineering at Princeton University, advised by Prof. David Wentzlaff. He received B.Sc. in Electronic Engineering from Tsinghua University in 2016 and M.A. in Electrical Engineering from Princeton University in 2018. He is interested in all aspects of computer architecture and VLSI design, especially heterogeneous and reconfigurable architectures. He is an experienced chip builder and an active contributor to multiple open-source projects. He is on the academic job market at the time of the talk.

Portrait of Tony Nowatzki
(Invited) Friday, November 4, 2022 | 2pm~3pm ET

OverGen: Improving FPGA Usability through Domain-specific Overlay Generation
Tony Nowatzki, University of California at Los Angelos

Abstract: The mainstream programming approach for FPGAs is high level synthesis (HLS). Unfortunately, HLS leaves a significant programmability gap in terms of reconfigurability, customization and versatility: 1. FPGA physical design can take hours, 2. FPGA reconfiguration time limits the applicability of HLS to workloads with little dynamic behavior, and 3. HLS tools do not reason about cross-workload flexibility. Overlay architectures mitigate the above by mapping a programmable design (e.g. CPU, GPU, etc.) on top of FPGAs. However, the abstraction gap between overlay and FPGA leads to low efficiency/utilization.

Our essential idea is to develop a hardware generation framework targeting a highly-customizable overlay, so that the abstraction gap can be lowered by tuning the design instance to applications of interest. We leverage and extend prior work on customizable spatial architectures, SoC generation, accelerator compilers, and design space exploration (DSE) to create an end-to-end FPGA acceleration system called OverGen. OverGen can compete in performance with state-of-the-art HLS techniques, while requiring 10,000x less compile time and reconfiguration time.
Bio: Tony Nowatzki is an associate professor in the Computer Science Department at the University of California, Los Angeles, where he leads the PolyArch Research Group. He joined UCLA in 2017 after completing his PhD at the University of Wisconsin - Madison. While at UCLA he served as a consultant for Simple Machines Inc., an AI hardware startup that used several of his patents in fabricated chips. Recognition of his work includes four IEEE Micro Top Picks awards, a CACM Research Highlights, best paper nominations at MICRO and HPCA, and a PLDI Distinguished Paper Award.

Portrait of Bhushan Chitlur
(Invited) Friday, October 28, 2022 | 2pm~3pm ET

Open FPGA Stack (OFS)
Bhushan Chitlur, Intel

Abstract: The OFS (Open FPGA Stack) aims to reduce that burden of building E2E solutions by leveraging a library of configurable RTL building blocks that are wrapped up as HW/SW workflow. The OFS workflow decomposes the platform functionality into its primary ingredients ie. PCIe, HSSI, Mem, Manageability and provides configurable subsystems that can be used by customers to build their stacks. The OFS workflow comprehends custom boards and custom requirements unlike the prior PAC architecture.
Bio: Bhushan Chitlur is a Senior Principal Engineer in the Datacenter and AI group at Intel Corp focused on next generation of heterogenous datacenter accelerators and memory solutions using FPGAs and custom ASICs. This includes close coupled accelerator architectures using CXL, for emerging workload, edge/cloud, and infrastructure acceleration. He was the lead architect of industry first Xeon+FPGA MCP, and drives a portfolio of technologies required to deploy FPGAs in the Datacenter. He has 18 publications, 20 issued, 10+ patents pending.

Portrait of Andrew Boutros
Friday, October 14, 2022 | 2pm~3pm ET

RAD-Sim: Rapid Architecture Exploration for Novel Reconfigurable Acceleration Devices
Andrew Boutros, University of Toronto

Abstract: To improve the efficiency of FPGAs for new datacenter use cases and data-intensive applications, a new class of reconfigurable acceleration devices (RADs) is emerging. In these devices, the FPGA fine-grained reconfigurable fabric is a component of a bigger monolithic or multi-die system-in-package that can incorporate general-purpose software-programmable cores, domain-specialized accelerator blocks, and high-performance NoCs for efficient communication between these system components. The integration of all these components in a RAD results in a huge design space and requires re-thinking the implementation of applications that need to be migrated from conventional FPGAs to these novel devices. In this talk, I will present RAD-Sim, an architecture simulator that allows rapid design space exploration for RADs and facilitates the study of complex interactions between their various components. I will also go through a case study that highlights the utility of RAD-Sim in re-designing applications for these novel RADs by mapping the neural processing unit (NPU) AI inference overlay to different RAD instances.
Bio: Andrew Boutros is a Ph.D. candidate at the University of Toronto ECE department under supervision of Prof. Vaughn Betz. He received his B.Sc. degree in electronics engineering from German University in Cairo in 2016, and his M.A.Sc. degree in computer engineering from the University of Toronto in 2018. His research interests include FPGA architecture and CAD, deep learning acceleration, and domain-specific architectures.

Portrait of Lina Sawalha
(Invited) Friday, August 5, 2022 | 2pm~3pm ET

FPGA-Predict: Performance and Power Prediction of FPGAs Using Machine Learning and Application Characteristics
Lina Sawalha, Western Michigan University

Abstract: Recent developments in high-level synthesis (HLS) tools resulted in a wider use of Field Programmable Gate Arrays (FPGAs) for different domains of applications. While HLS tools allow non- Hardware Description Language (HDL) experts to explore FPGAs to accelerate their applications, they are still challenging to use. Users need to learn how to use HLS tools, add pragmas, and modify and optimize their applications to explore the benefits of FPGAs. Existing FPGA performance estimation models either depend on HLS reports, which is time-consuming, or are not generalizable.

This talk presents a fast and accurate machine-learning (ML) based technique to predict the performance and power consumption of FPGAs. Our ML prediction model does not require HLS reports; it relies on CPU code only. It uses static source code features, intermediate representation, and CPU execution dynamic features to predict the power and performance of FPGAs. Our delicate feature selection model chooses the best set of features to cluster applications and predict their execution time and power consumption of FPGAs. We used an ensemble ML design that avoids overfitting and results in an accurate, robust, and generalizable method. We validated our methodology using K-fold cross-validation and used different FPGA devices.
Bio: Lina Sawalha is Associate Professor in the Electrical and Computer Engineering Department at Western Michigan University. She has been a visiting faculty at Carnegie Melon University this past year. Her research interests include compiler and high-level synthesis optimization, heterogeneous architectures and systems, computer architecture, and performance analysis. She received her Ph.D. and M.Sc. degrees from the University of Oklahoma in 2012 and 2009, respectively.

Portrait of Scott Weber
(Invited) Friday, July 29, 2022 | 2pm~3pm ET

Soft NOC: Leveraging HyperFlex and Long Wires to Construct High-Performance Pipelined Busses
Scott Weber, Intel

Abstract: We construct high-performance pipelined busses using the long wires and HyperFlex registers of Agilex. These pipelined busses are used as the underlying components for point-to-point connections which can form the basis of more complex soft NOC constructions. We demonstrate that these soft logic pipelined busses can be included in hierarchical design flows including partial reconfiguration with minimal impact to the logic they are traversing over.
Bio: Scott Weber is a Principal Engineer on at Intel PSG. Since joining Altera/Intel in 2014, he has led the development of the partial reconfiguration architecture of S10, Agilex and future architectures. On Agilex-M, he was an architect on the NOC. He continues to lead pathfinding and development of future architectures. Scott leads the RV5 (From “Field Programmable” to “Programmable”) from the Intel side for the Crossroads 3D-FPGA Academic Research Center. Prior to joining Intel, Scott was a software engineer at Tabula working on synthesis, 3D analytical placement and on-device debugging techniques like SignalTap. Scott has co-authored 19 granted patents. He received his Ph.D. in Electrical and Computer Engineering from the University of California, Berkeley, and B.S. degrees in Electrical and Computer Engineering and Computer Science from Carnegie Mellon University.

Portrait of Prashanth Mohan
(Invited) Friday, June 24, 2022 | 2pm~3pm ET

Soft Embedded FPGA Fabrics: Top-down Physical Design and Applications
Prashanth Mohan, Carnegie Mellon University

Abstract: Embedded FPGA (eFPGA) fabrics are increasingly used in modern System-on-Chip (SoC) designs as their programmability can be leveraged to accelerate a variety of workloads and enable upgradeability, feature addition, and security. As technology scales down to sub 5nm nodes, designing eFPGA fabrics using custom layout techniques requires extensive design time (many months), suffers from poor process portability, and is not compatible with demanding SoC design schedules. On the other hnd, soft eFPGA fabrics described in RTL and designed using standard cells provide effortless process portability and have the potential to reduce the eFPGA physical design cycle from months to less than a day. Conventional design methodologies for implementing standard-cell-based eFPGA employ a bottom-up approach wherein individual tiles are synthesized in isolation and later stitched together to generate the large FPGA fabric. However, the bottom-up approach significantly deviates from push-button ASIC flows and requires manual floorplanning and buffering strategy for each FPGA architecture and process technology.

This work proposes a top-down design methodology fully compatible with standard ASIC design flows to facilitate the agile physical design of soft eFPGA just like any other digital block, without the manual effort required in bottom-up flow. We developed a soft eFPGA fabric generator using CHISEL and used it to tapeout a proof-of-concept homogenous and heterogeneous fabrics with BRAM and DSP tiles on 16nm and 22nm industrial CMOS FinFET process nodes. The true potential of soft eFPGA comes to light when it is integrated with other designs to enable new applications that were previously difficult to realize. We present two such applications: hardware redaction and reconfigurable co-processor for the RISC-V CPU. First, the idea of hardware redaction, a hardware obfuscation approach, is proposed to allow designers to substitute security-critical IP blocks within a design with a synthesizable eFPGA. eFPGA redaction was demonstrated by obfuscating the control path of a RISC-V CPU. Second, a heterogeneous soft eFPGA fabric was integrated as a RISC-V co-processor to support custom RISC-V instructions on a 22nm SoC test chip.
Bio: Prashanth Mohan is a Ph.D. student at Carnegie Mellon University. Prior to that, he completed his Masters in Electronics Design at the Indian Institute of Science, Bangalore, and then worked as a physical design engineer in Nvidia for two and a half years. His research interests include VLSI design, FPGAs, and hardware security.

Portrait of Mohamed S. Abdelfattah
(Invited) Friday, April 22, 2022 | 2pm~3pm ET

FPGAs are (not) Good at Deep Learning
Mohamed S. Abdelfattah, Cornell University

Abstract: There have been many attempts to use FPGAs to accelerate deep neural networks (DNNs), including many by the speaker of this talk. Some of these attempts ended up facing direct competition from GPUs and ASICs that are hyper-tuned for DNNs–inevitably, FPGAs often lose in that competition. However, there are many promising research directions in which FPGAs are indeed the best platform to accelerate parts of a deep learning workload. This talk will discuss several emerging paradigms in which FPGA strengths can be successfully leveraged for accelerating deep learning workloads. I will focus on (1) Automated DNN-HW codesign, (2) Using FPGA lookup tables as DNN building blocks and (3) The role of embedded networks on-chip in FPGA-powered datacenters.
Bio: Mohamed Abdelfattah is an Assistant Professor at Cornell Tech and in the Electrical and Computer Engineering Department at Cornell University. His research interests include deep learning systems, automated machine learning, hardware-software codesign, reconfigurable computing, and FPGA architecture. Mohamed’s goal is to design the next generation of machine-learning-centric computer systems for both datacenters and mobile devices.

Mohamed received his BSc from the German University in Cairo, his MSc from the University of Stuttgart, and his PhD from the University of Toronto. His PhD was supported by the Vanier Canada Graduate Scholarship and he received three best paper awards for his work on embedded networks-on-chip for FPGAs. His PhD work garnered much industrial interest and has since been adopted by multiple semiconductor companies in their latest FPGAs. After his PhD, Mohamed spent time at Intel’s programmable solutions group, and most recently at Samsung where he led a research team focused on hardware-aware automated machine learning.

Portrait of Sang-Woo Jun
(Invited) Friday, March 25, 2022 | 2pm~3pm ET

Near-Storage Acceleration in Practice: Opportunities and Challenges
Sang-Woo Jun, University of California, Irvine

Abstract: Modern high-density, high-performance storage devices coupled with power-efficient accelerators such as FPGAs have demonstrated extremely good cost and power efficiency on various applications, compared to conventional computer systems. Many off-the-shelf commercial offerings such as the Samsung SmartSSD already exist, putting such benefits within reach for real-world applications. However, extracting the most benefits from near-storage acceleration requires drastic changes to the role of the storage device as well as how the rest of the software interacts with it, which is a daunting process due to the differences in the abstraction level, programming model, and performance characteristics of near-storage acceleration compared to conventional storage.

In this talk, we present some of the more prominent challenges and the design patterns we discovered that help overcome them. We base our discoveries on multiple important applications including graph analytics and relational database queries, as well as unstructured log analytics explored in collaboration with VMware, targeting the Samsung SmartSSD platform. Our near-storage log analytics accelerator design is efficient enough to make the best possible use of underlying storage bandwidth, resulting in an order of magnitude throughput improvement compared to pure software such as Splunk and MonetDB, equipped with comparable system resources. These experiences show that it is feasible to augment cloud software with near-storage acceleration, resulting in a dramatically lower cost of operating cloud deployments.
Bio: Sang-Woo Jun is an assistant professor at the department of computer science, University of California, Irvine. He received his Ph.D in 2018 from the Massachusetts Institute of Technology, for his work on near-storage accelerators for graph analytics. His current research continues the topic of reconfigurable hardware accelerators coupled with fast storage devices for the purpose of making large-scale data analytics more affordable, targeting a wide array of scientific and enterprise applications.

Portrait of Mohamed Ibrahim
Friday, February 25, 2022 | 2pm~3pm ET

High Performance CNN Inference Acceleration on FPGAs
Mohamed Ibrahim, University of Toronto

Abstract: Field Programmable Gate Arrays (FPGAs) are programmable devices that can implement any digital circuit. FPGAs have gained popularity in accelerating CNN computations due to their programmability, energy efficiency, customized operand precisions, and low time to market. HPIPE is a sparsity-aware deeply-pipelined CNN inference accelerator that converts a Tensorflow graph of a CNN into specialized hardware units that implement each layer in a CNN. HPIPE outperforms all CNN inference accelerators on PGAs; moreover, its performance surpasses a V100 GPU on ResNet-50 at a batch size of one. CNNs are used extensively in image classification, but it is not the only use for CNNs. Object detection is another critical application that incorporates CNNs. We integrate HPIPE with a hardware-friendly unit to accelerate object detection. In order to accelerate large CNNs and further enhance PIPE's performance, we develop an end-to-end flow to partition CNNs across multiple FPGAs.
Bio: Mohamed Ibrahim is currently a M.A.Sc. student at the department of Electrical and Computer Engineering (ECE) at the University of Toronto. He holds a BSc degree in Electronics and Communications Engineering from the American University in Cairo. His MSc main focus is scaling machine learning accelerators to systems of FPGA clusters. His work is done in collaboration with Intel PSG CTO.

Portrait of Akshitha Sriraman
(Invited) Friday, February 18, 2022 | 2pm~3pm ET

Re-thinking Data Center Hardware Architectures from the Ground-up
Akshitha Sriraman, Carnegie Mellon University

Abstract: Current hardware and software systems were conceived at a time when we had scarce compute and memory resources, limited data and application functionality, and easy hardware performance scaling due to Moore's Law. These assumptions are not true today. Today, modern data centers must manage a rapid growth in data, users, and application functionality, while also dealing with a decline in hardware performance scaling. However, modern server hardware has not sufficiently grown to meet these new data center application requirements. In fact, the fundamental architecture of a modern server still dates back to the compute-centric desktop PCs of the 1980s, managing memory at hardware speeds but accessing I/O through legacy software stacks and peripheral interfaces.

In this talk, I will focus on meeting modern web application requirements by fundamentally re-thinking data center hardware architectures from the ground-up. Specifically, I will detail my efforts towards answering the question of: How should we build data center hardware for emerging software paradigms in the post-Moore era? I will then conclude by describing my ongoing and future research on moving from a compute-centric to a data-centric hardware architecture to meet modern web application requirements.
Bio: Akshitha Sriraman is an Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. Her research interests are in the area of bridging computer architecture and systems software, with a focus on making hyperscale data centers more efficient (via solutions that span the systems stack). The central theme of her work is to design software that is aware of new hardware constraints/possibilities and architect hardware that efficiently supports new hyperscale software requirements.

Sriraman's research has been recognized with an IEEE Micro Top Picks distinction and the 2021 David J. Kuck Dissertation Prize. She was awarded a Facebook Fellowship, a Rackham Merit Ph.D. Fellowship, and a CIS Full-Tuition Scholarship. She was also named a 2019 Rising Star in EECS. Sriraman completed her Ph.D. in Computer Science and Engineering at the University of Michigan.

Portrait of Hugo Sadok
Friday, February 11, 2022 | 2pm~3pm ET

Redesigning NIC Interfaces for Direct Application Access
Hugo Sadok, Carnegie Mellon University

Abstract: The increasing gap between network throughput and CPU performance has shifted the way network-intensive applications transfer data. These applications can no longer afford the overheads of the kernel network stack and instead communicate directly with the Network Interface Card (NIC). While this significantly improves performance, it is still challenging for applications to achieve the line rates offered by latest NICs. One of the factors that precludes applications from reaching the full potential of the communication hardware is that NICs expose a packet-level interface that places each individual packet in a separate fixed-sized memory buffer. Unfortunately, dedicating a buffer per packet imposes buffer management overhead and scattered memory accesses that interact poorly with the CPU cache.

In this talk I will present a new NIC interface design that eliminates most of the overheads imposed by the traditional packet-level interface. This new design builds upon two novel techniques: contiguous data buffers and reactive descriptors. Contiguous data buffers eliminate the need for buffer management while increasing L1d cache hits by making memory accesses sequential. Reactive descriptors pace arrival notifications according to how fast applications consumes the data, avoiding overwhelming applications with unnecessary notifications. I will show how this design lets us achieve beyond 140 Mpps (94 Gbps with min-sized packets) with a single CPU core compared to 64 Mpps (43 Gbps with min-sized packets) using a state-of-the-art NIC.
Bio: Hugo Sadok is a second-year PhD student in Computer Science at CMU advised by Prof. Justine Sherry and part of the SNAP Lab. His research interests are broadly in computer networks and computer systems. Prior to CMU, he received a BS in Electronic and Computer Engineering and an MS in Electrical Engineering, both from UFRJ.

Portrait of Huaicheng Li
(Invited) Friday, February 4, 2022 | 2pm~3pm ET

Towards Predictable and Efficient Datacenter Storage
Huaicheng Li, Carnegie Mellon University

Abstract: The increasing complexity in storage software and hardware brings new challenges to achieve predictable performance and efficiency. On the one hand, emerging hardware break long-held system design principles and are held back by aged and inflexible system interfaces and usage models, requiring radical rethinking on the software stack to leverage new hardware capabilities for optimal performance. On the other hand, the computing landscape is becoming increasingly heterogeneous and complex, demanding explicit systems-level support to manage hardware-associated complexity and idiosyncrasy, which is unfortunately still largely missing.

In this talk, I will discuss my efforts to build low-latency and cost-efficient datacenter storage systems. By revisiting existing storage interface/abstraction designs and software/hardware responsibility divisions, I will present holistic storage stack designs for cloud datacenters, which deliver orders of magnitude of latency improvement and significantly improved cost-efficiency.
Bio: Huaicheng is a postdoc at CMU in the Parallel Data Lab (PDL). He received his Ph.D. from University of Chicago. His interests are mainly in Operating Systems and Storage Systems, with a focus on building high-performance and cost-efficient storage infrastructure for datacenters. His research has been recognized by two best paper nominations at FAST (2017 and 2018) and has also made real impact, with production deployment in datacenters, code integration to Linux, and a storage research platform widely used by the research community.

Portrait of Daehyeok Kim
(Invited) Friday, Janurary 28, 2022 | 2pm~3pm ET

Unleashing the Potential of In-Network Computing
Daehyeok Kim, Microsoft

Abstract: Recent advances in programmable networking hardware create a new computing paradigm called in-network computing. This new paradigm allows functionality that has been served by commodity servers, ranging from network middleboxes to components of distributed systems, to be performed in the network. I argue that to fully unleash its potential, we need resource elasticity and fault resiliency via higher-level abstractions.

In this talk, I demonstrate that in-network computing can be elastic and resilient by designing high-level abstractions and runtime systems that enable us to effectively leverage compute and memory resources available outside of a single type of device -- e.g., programmable switches -- while hiding the complexities of dealing with device heterogeneity. I begin by introducing TEA, a framework that provides elastic memory by enabling memory-intensive in-switch applications, such as cloud-scale load balancers, to leverage DRAM on remote servers via virtual table abstraction. Then I present ExoPlane and RedPlane, frameworks that support evolving in-network computing workloads and requirements -- i.e., serving multiple concurrent applications and making them fault-tolerant -- via infinite switch resource and one big fault-tolerant switch abstractions. Several systems in the industry are now adopting some of the technologies presented in this talk.
Bio: Daehyeok Kim is a senior researcher at Microsoft. He recently completed his Ph.D. in the computer science department at Carnegie Mellon University. His research interests lie in the intersection of computer systems and networking with a focus on building new abstractions and runtime systems for in-network computing. He is a recipient of the Microsoft Research Ph.D. Fellowship.

Portrait of Nirav Atre
Friday, November 19, 2021 | 2pm~3pm ET

SurgeProtector: Mitigating Algorithmic Complexity Attacks using Adversarial Scheduling
Nirav Atre, Carnegie Mellon University

Abstract: Algorithmic complexity attacks (ACAs) are a class of Denial-of-Service (DoS) attacks where an attacker uses a small amount of adversarial traffic to induce a large amount of work in the target system, pushing the system into overload and causing it to drop packets from innocent users. ACAs are particularly dangerous because, unlike volumetric DoS attacks, ACAs don't require a significant network bandwidth investment from the attacker. Today, network functions (NFs) on the Internet must be painstakingly designed and engineered on a case-by-case basis to mitigate the debilitating impact of ACAs. Further, the resulting designs tend to be overly conservative in their attack mitigation strategy, limiting the innocent traffic that the NF can serve during common-case operation.

In this talk, I will present a general framework we designed to make any NF more resilient to ACAs without the limitations of prior approaches. Our framework, SurgeProtector, uses the NF's scheduler to mitigate the impact of ACAs using a very traditional scheduling algorithm---Weighted Shortest Job First (WSJF). To evaluate SurgeProtector, we propose a new metric of DoS vulnerability called the Displacement Factor (DF), which quantifies the maximum "harm per unit effort" an adversary can inflict on the system. Using novel insights from adversarial scheduling theory, we show that any system using WSJF has a worst-case DF of only a small constant (unity), where traditional schedulers would place no upper bound on the adversary's DF. Illustrating that SurgeProtector is not only theoretically, but practically robust, we integrate SurgeProtector into an open source Intrusion Detection System (IDS). Under simulated attack, the SurgeProtector-augmented IDS suffers 90-99% lower innocent traffic loss than the original system.

Bio: Nirav is a fourth-year Ph.D. student in Computer Science at Carnegie Mellon University (CMU) advised by Prof. Justine Sherry. His research interests broadly lie at the intersection of networking and performance modeling. Prior to starting graduate school, Nirav completed his BASc in Computer Engineering at the University of Toronto, Canada, in 2018.

Portrait of Jing Li
(Invited) Friday, November 5, 2021 | 2pm~3pm ET

It is All About Abstraction: Virtualizing FPGAs in the Cloud
Jing Li, University of Pennsylvania

Abstract: We have seen growing interests and benefits in exploiting FPGAs as a first-class citizen in cloud computing. Cloud vendors such as Amazon and Microsoft have begun to support on-demand FPGA acceleration in various forms of cloud service. Nonetheless, system support for cloud FPGAs is still in its infancy. The lack of efficient virtualization support makes it challenging to fully unleash the benefits of integrating FPGAs into the cloud infrastructure, leading to low elasticity and resource utilization. There are many historical and practical reasons for that: traditional FPGAs and the associated compilation tools are not designed and optimized for the multi-tenant and resource-sharing cloud computing environment. And there is no widely adopted simple hardware/software interface for spatial architecture i.e., FPGA compared to temporal architecture such as CPU.

In this talk, I will present our exploratory efforts to address these limitations. I will first present the key requirements that we identified for virtualizing spatial architecture and present a generic virtualization stack that satisfies the requirements for heterogeneous FPGA clusters. Specifically, I will introduce a two-level system abstraction that can decouple the compilation and resource allocation and thus enables fine-grained resource management with low compilation overhead. I will present how we modify existing compilation flow and runtime management to leverage the proposed abstraction to achieve efficient virtualization. Finally, I will discuss further optimization opportunities through two case studies.

Bio: Jing (Jane) Li is the Eduardo D. Glandt Faculty Fellow and Associate Professor of Electrical and Systems Engineering and of Computer and Information Science at the University of Pennsylvania. She is broadly interested in developing fundamental methods for workload optimized systems. To validate the research ideas, her research puts a strong emphasis on real system prototyping both at chip level and system level. She is the recipient of DARPA's Young Faculty Award, NSF Career Award, IBM Research Division Outstanding Technical Achievement Award for successfully achieving CEO milestone, multiple invention achievement awards and high value patent application awards from IBM. Previously she was the Dugald C. Jackson Assistant Professor at the University of Wisconsin–Madison and a faculty affiliate with the UW-Madison Computer Architecture group and Machine Learning group. She is one of the PIs in SRC JUMP center – Center for Research on Intelligent Storage and Processing-In-Memory (CRISP). She spent her early career at IBM T. J. Watson Research Center as a Research Staff Member after obtaining her PhD degree from Purdue University.

Portrait of Naif Tarafdar and Paul Chow
(Invited) Friday, October 29, 2021 | 2pm~3pm ET

AIgean: An Open Framework for Deploying Machine Learning on Heterogeneous Clusters
Naif Tarafdar and Paul Chow, University of Toronto

Abstract: AIgean, pronounced like the sea, is an open framework to build and deploy machine learning (ML) algorithms on a heterogeneous cluster of devices (CPUs and FPGAs). We present AIgean as a use case for our multi-FPGA deployment infrastructure: Galapagos. AIgean provides a full end-to-end multi-FPGA/CPU implementation of a neural network. The user supplies a high-level neural network description and our tool flow is responsible for the synthesizing of the individual layers, partitioning layers across different nodes as well as the bridging and routing required for these layers to communicate. If the user is an expert in a particular domain and would like to tinker with the implementation details of the neural network, we define a flexible implementation stack for ML that includes the layers of Applications & Algorithms, Cluster Deployment & Communication, and Hardware. The Cluster Deployment & Communication and Hardware leverages the Galapagos layer abstractions where the communication protocol is abstracted from the application and the hardware implementations are abstracted from the physical hardware being used. This allows the user to modify specific layers of abstraction without having to worry about components outside of their area of expertise. We demonstrate the effectiveness of AIgean with three use cases: a small network running on a single network-connected FPGA, an autoencoder running on three FPGAs, and ResNet-50 running across twelve FPGAs.
Bio: Naif Tarafdar is a fifth year PhD candidate at the University of Toronto. He has previously interned at Xilinx Research and Microsoft Research. His main research interest is in democratizing heterogeneous compute to give access to as many new users as possible. This can be through abstraction layers, APIs and programming models. He is the chief architect in Galapagos, a heterogeneous multi-FPGA development stack at the University of Toronto.

Paul Chow is a professor in the faculty of The Edward S. Rogers Sr. Department of Electrical and Computer Engineering at the University of Toronto. He is a Fellow of the IEEE and Fellow of the Engineering Institute of Canada. His main research is about making FPGAs into computing devices so that applications can be easily deployed. In particular, he wants to do this at scale in a heterogeneous environment where FPGAs seamlessly interact with CPUs and other devices, all as peers, and transparently to the application.

Portrait of David Z. Pan
Friday, October 22, 2021 | 2pm~3pm ET

FPGA Placement: Recent Progress and Road Ahead
David Z. Pan, The University of Texas at Austin

Abstract: In the FPGA implementation flow, placement plays a crucial role in determining the overall quality of results and runtime. After synthesis and logic mapping, placement determines the physical locations of heterogeneous instances to optimize wirelength, timing, power, routability, etc., while meeting various constraints in modern FPGAs. This talk will give an overview of recent progress of FPGA placement targeting large-scale heterogeneous FPGAs, including UTPlaceF which won ISPD FPGA Placement Contests before, and the current academic state-of-the-art elfPlace. Since placement may need to be called many times to achieve design closure, it is very important to ensure high scalability with increasing design complexity, e.g., on future 3D FPGAs. We will discuss how to scale-up and accelerate FPGA placement algorithms. We will also discuss some future directions, e.g., open-source to enable cross-team collaborations, and leveraging machine learning hardware/software for FPGA placement.
Bio: David Z. Pan is a Professor and Silicon Laboratories Endowed Chair at the Department of Electrical and Computer Engineering, The University of Texas at Austin.  His research interests include bidirectional AI and IC interactions, electronic design automation, design for manufacturing, hardware security, and CAD for analog/mixed-signal ICs and emerging technologies. He has published over 400 refereed journal/conference papers and 8 US patents. He has served in many journal editorial boards and conference committees, including various leadership roles such as ICCAD 2019 General Chair, ASP-DAC 2017 TPC Chair, and ISPD 2008 General Chair. He has received many awards, including SRC Technical Excellence Award, 19 Best Paper Awards (at DAC, ICCAD, DATE, ASP-DAC, ISPD, HOST, etc.), DAC Top 10 Author Award in Fifth Decade, ASP-DAC Frequently Cited Author Award, Communications of ACM Research Highlights, ACM/SIGDA Outstanding New Faculty Award, NSF CAREER Award, IBM Faculty Award (4 times), and many international CAD contest awards. He has graduated 40 PhD students and postdocs who have won many awards, including the First Place of ACM Student Research Competition Grand Finals (twice, in 2018 and 2021), ACM/SIGDA Student Research Competition Gold Medal (three times), ACM Outstanding PhD Dissertation in EDA Award (twice), EDAA Outstanding Dissertation Award (twice), etc. He is a Fellow of IEEE and SPIE.

Portrait of Justine Sherry
Friday, September 24, 2021 | 2pm~3pm ET

Crossroads RV1: Exploring Data on the Move Applications
Justine Sherry, Carnegie Mellon University

Abstract: The Crossroads FPGA is uniquely positioned to support applications which operate with high throughput over data "on the move" between endpoints such as CPUs, GPUs, Storage, Network, and other platforms. In this talk, we will highlight two applications under development in the Crossroads center. First, the Pigasus IDS is a 100Gbps hybrid FPGA + CPU platform for network security. Next, Norman is a new network dataplane for Linux that offloads the OS networking stack onto a Crossroads FPGA. We will discuss the high level goals of Pigasus and Norman, some of their design details, and finally contrast their two different approaches to using Crossroads: Pigasus, as an "FPGA-centric" design with the CPU working in support of the FPGA, and Norman as a "CPU-centric" design, with the FPGA working in support of the CPU.
Bio: Justine Sherry is an assistant professor at Carnegie Mellon University. Her interests are in computer networking; her work includes middleboxes, networked systems, measurement, cloud computing, and congestion control. Her recent research focuses on new opportunities and challenges arising from the deployment of middleboxes -- such as firewalls and proxies -- as services offered by clouds and ISPs. Dr. Sherry received her PhD (2016) and MS (2012) from UC Berkeley, and her BS and BA (2010) from the University of Washington. She is a recipient of the SIGCOMM doctoral dissertation award, the David J. Sakrison prize, paper awards at USENIX NSDI and ACM SIGCOMM, and an NSF Graduate Research Fellowship. Most importantly, she is always on the lookout for a great cappuccino.

Portrait of Vaughn Betz
Friday, July 23, 2021 | 2pm~3pm ET

Verilog to Routing (VTR): A Flexible Open-Source CAD Flow to Explore and Target Diverse FPGA Architectures
Vaughn Betz, University of Toronto

Abstract: With the need for improvements in compute performance and efficiency beyond what process scaling can provide, FPGAs and FPGA-like programmable accelerators that can target a range of compute tasks efficiently are of interest in many application areas. However, creating a new CAD flow that can evaluate and map circuits to a new programmable architecture remains a daunting task, making flexible CAD flows that can be quickly retargeted to new architectures highly desirable.

This talk will give an overview of the Verilog-to-Routing (VTR) open source tool flow that addresses this need. We'll discuss recent enhancements to VTR that have broadened the range of architetures it can target, and allow it to not only evaluate new FPGA architectures, but also program the chosen architectures that are committed to silicon.

Architecture flexibility can have a cost however, and a common conception in the FPGA Computer Aided Design (CAD) community is that architecture-specific algorithms and tools will significantly out-perform more general approaches which target a variety of FPGA architectures. In this talk we'll show how through careful algorithm design and code architecture VTR has improved result quality without architecture-specific code, challenging the idea that result quality and architecture flexibility are mutually exclusive. We will detail the key packing and routing enhancements that led to large improvements in wirelength and timing, while simultaneously reducing run time by over 6x.

Finally, we'll present efforts to use Reinforcement Learning to create more adaptable and efficient CAD algorithms. Taking placement as an example, we'll show how an RL-enhanced move generator can improve the quality/run-time trade-off of VTR's placement algorithm.
Bio: Vaughn Betz is a Professor and the NSERC/Intel Industrial Research Chair in Programmable Silicon at the University of Toronto. He is the original developer of the widely used VPR FPGA placement, routing and architecture evaluation CAD flow, and a lead developer in the VTR project that has built upon VPR. He co-founded Right Track CAD to commercialize VPR, and joined Altera upon its acquisition of Right Track CAD. Dr. Betz spent 11 years at Altera, ultimately as Senior Director of software engineering, and is one of the architects of the Quartus CAD system and the first five generations of the Stratix and Cyclone FPGA families. He holds 101 US patents and has published over 100 technical articles in the FPGA area, thirteen of which have won best or most significant paper awards. Dr. Betz is a Fellow of the IEEE and the National Academy of Inventors, and a Faculty Affiliate of the Vector Institute for Artificial Intelligence.

Portrait of James C. Hoe
Friday, July 16, 2021 | 2pm~3pm ET

From “Field Programmable” to “Programmable”
James C. Hoe, Carnegie Mellon University

Abstract: This talk is an overview of RV5.

To elevate FPGAs from logic to computing roles, we need to address the greater requirement for programmability beyond being a “field programmable” ASIC. A computing FPGA will be asked to do more tasks than could fit on the fabric at once and to do new tasks that are unknown before deployment. Moreover, dynamically managing the logic resource utilization is a presently under-tapped source of performance optimization----by devoting available resources to only active tasks or by supporting tasks with differently-optimized design variants to changing conditions.

To maximally exploit the benefits of FPGAs’ programmability, the Intel/VMware Crossroads 3D FPGA Academic Research Center aims to make runtime reprogramming a regular mode of operation for Crossroads 3D-FPGA in future datacenter servers. This talk will motivate the need for a new, expanded design mindset by FPGA users and designers to fully pursue FPGAs’ programmability and dynamism. The talk next presents the Crossroads Center’s research toward realizing this new usage and programming on the Crossroads 3D-FPGA for datacenter applications. The talk will present a re-design of the Pigasus network intrusion detection/prevention system (IDS/IPS) following a design methodology to leverage the flexible and dynamic capabilities of FPGA targets.
Bio: James C. Hoe is a Professor of Electrical and Computer Engineering at Carnegie Mellon University. He received his Ph.D. in EECS from Massachusetts Institute of Technology in 2000 (S.M., 1994). He received his B.S. in EECS from UC Berkeley in 1992. He is interested in many aspects of computer architecture and digital hardware design, including the specific areas of FPGA architecture for computing; digital signal processing hardware; and high-level hardware design and synthesis. He is a Fellow of IEEE. For more information, please visit

Portrait of Zhipeng Zhao
Friday, July 9, 2021 | 2pm~3pm ET

Pigasus: Efficient Handling of Input-Dependent Streaming on FPGAs
Zhipeng Zhao, Carnegie Mellon University

Abstract: FPGAs have well-demonstrated success in many networking applications but failed in accelerating Intrusion Detection and Prevention Systems(IDS/IPS). The root cause is the mismatch of the traditional static, fixed-performance FPGA design and input-dependent behaviors of IDS/IPS. As a result, the design is provisioned to handle worst-case, losing the opportunity to utilize the resource allocated for worst-case to improve the common-case performance.

In this talk, I will present an FPGA based IDS/IPS called Pigasus which is tailored to the common-case, thus using minimal resources to extract maximum performance. Pigasus can achieve 100Gbps using 1 FPGA and on average 5 CPU cores, 100x faster than CPU-only baseline and 50x faster than existing FPGA designs. A natural objection to this design is that it will suffer from shifting workloads. In the second part, I will show how to use a disaggregated architecture and spillover mechanism to scale subcomponents of the system on demand to address changes in the traffic profile at both compile time and runtime.
Bio: Zhipeng Zhao is a Ph.D. candidate in Electrical and Computer Engineering at Carnegie Mellon University, advised by Prof. James C. Hoe. His research interests broadly lie at the intersection of FPGA and networking. Prior to CMU, he received a BS and an MS in Electrical Engineering, both from Beihang University, China.

Portrait of Derek Chiou
Friday, June 25, 2021 | 2pm~3pm ET

Soft Processor Overlays to Improve Time-to-Solution
Derek Chiou, The University of Texas at Austin

Abstract: Soft Processor Overlays are application-specific processors implemented in FPGA logic. Overlays can be more efficient than standard processors because they can be highly specialized and can get to a working implementation faster than dedicated circuits in FPGAs because they have software compile times and are more debuggable. In this talk, I will discuss prior work in overlays, how we plan to experiment with, develop, and use overlays in Research Vector 2 (RV2) of the Intel/VMware Crossroads 3D-FPGA Academic Research Center, and how those overlays will influence and interact with other research vectors in the center, such as investigations on 3D FPGA base-die architecture (RV3) and partial reconfiguration (RV5).
Bio: Derek Chiou is a Research Scientist in the Electrical and Computer Engineering Department at The University of Texas at Austin and a Partner Architect at Microsoft responsible for future infrastructure hardware architecture. He is a co-founder of the Microsoft Azure SmartNIC effort and lead the Bing FPGA team to first deployment of Bing ranking on FPGAs. Until 2016, he was an associate professor at UT. Before UT, Dr. Chiou was a system architect and lead the performance modeling team at Avici Systems, a manufacturer of terabit core routers. Dr. Chiou received his Ph.D., S.M. and S.B. degrees in Electrical Engineering and Computer Science from MIT.

Portrait of Sanil Rao
Friday, June 11, 2021 | 2pm~3pm ET

High-Performance Code Generation for Graph Applications
Sanil Rao, Carnegie Mellon University

Abstract: Software libraries have been a staple in computing, providing users with a maintained interface of functions for their applications. One such library, GraphBLAS, is used in the graph processing community because of its foundation in linear algebra, and its clear description of the overarching computation through its library calls. One issue that arises however, is these library calls have the potential to leave performance behind when looking for optimization, especially when one considers multiple library calls. Simply writing additional merged library calls is impractical given the importance of library clarity, and writing a general-purpose compiler that understands library call semantics would be infeasible. Therefore, we propose an approach from a higher level of abstraction, treating the GraphBLAS library as a specification, and generating code that understands the libraries’ semantics. We transform library calls to their linear algebraic descriptions, and use pattern matching techniques to look for optimizations. Preliminary results show that our code generation system, SPIRAL, achieves performance matching that of hand-optimized codes, while keeping the clarity of both the original library and user application.
Bio: Sanil Rao is a second-year PhD student In Electrical and Computer Engineering at CMU advised by Prof. Franz Franchetti and part of the SPIRAL group. His research focus is in the area of programming languages and compilers, specifically code generation. Prior to CMU, he received a BS in Computer Science from the University of Viriginia.

Portrait of Hugo Sadok
Friday, May 28, 2021 | 2pm~3pm ET

We Need Kernel Interposition over the Network Dataplane
Hugo Sadok, Carnegie Mellon University

Abstract: Kernel-bypass networking, which allows applications to circumvent the kernel and interface directly with NIC hardware, is one of the main tools for improving application network performance. However, allowing applications to circumvent the kernel makes it impossible to use tools (e.g., tcpdump) or impose policies (e.g., QoS and filters) that need to interpose on traffic sent by different applications running on a host. This makes maintainability and manageability a challenge for kernel-bypass applications. In response, we propose Kernel On-Path Interposition (KOPI), in which traditional kernel dataplane functionality is retained but implemented in a fully programmable SmartNIC. We hypothesize that KOPI can support the same tools and policies as the kernel stack while retaining the performance benefits of kernel bypass.
Bio: Hugo Sadok is a second-year PhD student in Computer Science at CMU advised by Prof. Justine Sherry and part of the SNAP Lab. His research interests are broadly in computer networks and computer systems. Prior to CMU, he received a BS in Electronic and Computer Engineering and an MS in Electrical Engineering, both from UFRJ.

Portrait of James C. Hoe
Friday, May 14, 2021 | 2pm~3pm ET

The Role for Programmable Logic in Future Datacenter Servers
(An Overview of the Crossroads Center)

James C. Hoe, Carnegie Mellon University

Abstract: This talk is a rerun for those affiliated with the center and intended to introduce the Center to the outside audience.

Field Programmable Gate Arrays (FPGAs) have been undergoing rapid and dramatic changes fueled by their expanding use in datacenter computing. Rather than serving as a compromise or alternative to ASICs, FPGA 'programmable logic' is emerging as a third paradigm of compute that stands apart from traditional hardware vs. software archetypes. The Crossroads 3D-FPGA Research Center has been formed with the goal to define a new role for programmable logic in future datacenter servers. Guided by both the demands of modern network-driven, data-centric computing and the new capabilities from 3D integration, this center is developing the Crossroads 3D-FPGA as a new central fixture component on future server motherboards, serving to connect all server endpoints (network, storage, memory, CPU) intelligently. As a literal crossroads of data, a Crossroads 3D-FPGA can apply application-specific functions over data-on-the-move between any pair of server endpoints, intelligently steer data to the right core or accelerator, and reduce and compress the volume of data that needs to be moved between servers. This talk will overview the Crossroads 3-D FPGA concepts, as well as the associated set of research thrusts to pursue a full-stack solution spanning application, programming support, dynamic runtime, design automation, and architecture.
Bio: James C. Hoe is a Professor of Electrical and Computer Engineering at Carnegie Mellon University. He received his Ph.D. in EECS from Massachusetts Institute of Technology in 2000 (S.M., 1994). He received his B.S. in EECS from UC Berkeley in 1992. He is interested in many aspects of computer architecture and digital hardware design, including the specific areas of FPGA architecture for computing; digital signal processing hardware; and high-level hardware design and synthesis. He is a Fellow of IEEE. For more information, please visit

Portrait of Joseph Melber
Friday, April 23, 2021 | 2pm~3pm ET

Raising the Level of Abstraction for FPGA System Design
Joseph Melber, Carnegie Mellon University

Abstract: Current Field Programmable Gate Array (FPGA) programming abstractions give disproportionate emphasis to reducing the design effort for processing kernels than to the memory access side of the design task. Designers are asked to build all the datapaths for on-chip buffering and data movements, as well as the state machines to coordinate these datapath activities. These datapaths are often ad-hoc efforts that are not generally reusable.

Software programmers leverage abstraction to simplify their design efforts⸺hardware designers should be supported by similar abstractions in order to increase FPGA programmability in modern computing systems. In this talk, I will focus on (1) re-imagining what memory should look like for FPGA hardware designers, and (2) virtualizing functionalities, devices and platforms for FPGA computing. I have been investigating a service-oriented abstraction and framework to simplify hardware design efforts for FPGA accelerator’s memory systems. The goal is to enable FPGA accelerator designers to configure a specialized memory system that presents abstract semantic-rich memory operations, across diverse memory devices, without performance overhead. Current efforts also extend this abstraction to virtualize these functionalities across devices and architectures. I will conclude by discussing the future potential of my research and vision for FPGA computing.
Bio: Joseph Melber is a Ph.D. candidate in Electrical and Computer Engineering at Carnegie Mellon University. He is advised by Dr. James C. Hoe. His research interests are reconfigurable computing, and computer architecture. His research focuses on memory systems and programming abstractions for heterogeneous FPGA computing systems. He received his M.S. in Electrical and Computer Engineering from Carnegie Mellon University in 2016, and B.S. in EE from the University at Buffalo in 2014.