Vincenzo Petrolo
Sessions
The increasing computational demands characteristic of contemporary deep learning models, particularly those associated with computer vision tasks employing Vision Transformers, present considerable constraints for energy-limited smart devices and edge computing platforms. To address this challenge, we demonstrate a RISC-V SoC that incorporates ARCANE, a 512KiB compute-capable Last-Level Cache, which enables In-Cache Computing (ICC). This capability is crucial for substantially mitigating the energy and latency overheads linked to data movement between the central processing unit (CPU) and main memory—a primary architectural bottleneck. To validate the system's operational maturity, we deploy models such as the 22-million parameter DINOv2-S and the lightweight MobileNetV2 utilizing the TVM framework. This deployment serves to demonstrate the platform's capacity to efficiently execute both state-of-the-art, computationally intensive computer vision workloads and standard image classification tasks within a unified environment. The system, instantiated on a ZCU104 FPGA featuring 1GiB of DDR4 memory, operates at a clock frequency of 80MHz and furnishes a Linux operating environment complete with a dedicated suite of user applications. These applications provide quantitative evidence of the significant performance advantages conferred by ARCANE's near-memory computing paradigm when compared against CPU-only execution. By integrating a custom tensor ISA that remains transparent and lock-less to the application programmer, ARCANE establishes itself as a valuable and pioneering contribution to the RISC-V ecosystem, representing one of the first In-Cache Computing IP cores integrated into a Linux operating environment.
Modern data-centric workloads increasingly expose the limitations of traditional von Neumann architectures, where excessive data movement limits throughput and energy efficiency.
While hardware accelerators improve performance, they often lack flexibility and still require costly memory transfers.
Existing compute in- and near-memory solutions reduce the memory bottleneck but introduce usability challenges related to constraints on data placement.
ARCANE is a cache architecture that doubles as a tightly-coupled near-memory coprocessor.
The embedded RISC-V cache controller executes custom instructions offloaded by the host CPU relying on near-memory vector processing units within the cache memory subsystem. This architecture hides memory synchronization and data mapping from application software, while offering software-based Instruction Set Architecture extensibility.
Evaluations demonstrate up to an 84x speedup on 8-bit convolution layers over a traditional system-on-chip, incurring only a 41.3\% area overhead.