Emanuele Venieri
Sessions
The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044 processor, an evolution of the SG2042 used in MCv2. We characterize MCv3 using HPL and STREAM benchmarks coupled with power measurements, and compare it against two reference platforms: the Intel Xeon Platinum 8480+ (Sapphire Rapids) and the NVIDIA Grace CPU Superchip. Our results show that the SG2044 more than doubles single-core performance and improves scalability compared to SG2042. MCv3 achieves an energy efficiency of 3.08GFLOPs/W which improves of 10x w.r.t. MCv1 and is in the range of x86-64 and Arm servers. On pure performance when normalized on the SIMD/Vector length MCv3 on its peak efficiency point (16 cores) achieves 46% performance of Intel Sapphire Rapids server and 91% performance of NVIDIA Grace CPU superchip.
Modern processors delegate power and thermal management to dedicated Power Control Systems (PCS), communicating through kernel-mediated interfaces such as SCMI or the emerging RPMI.
Prior work has shown that end-to-end control quality is dominated by the power-management policy rather than by interface latency, leaving room to choose communication paradigms based on flexibility rather than raw latency.
We integrate Micro XRCE-DDS on ControlPULP, a RISC-V–based PCS, connecting it to a user-space Agent on an ARM host via a custom shared-memory transport.
This design removes protocol logic from kernel drivers and naturally supports multi-controller coordination through a shared middleware layer. Experiments on a ZCU102 FPGA at 20 MHz show 490 μs of active processing per publication, 0.8 MB/s throughput, and a memory footprint under 11.2 KB for 32 topics. The resulting latency is comparable to SCMI [1] while enabling a more flexible communication model.
Matrix workloads, essential in generative AI, increasingly rely on ISA-level (i.e. AMX, SME). The attached matrix extension (AME) is one of the three (IME, AME, VME) ISA extensions under standardization in RISC-V. In common, all these matrix-ISA assumes extensions of the processor datapath with dedicated matrix acceleration hardware. However, executing matrix kernels requires moving large tiles between memory and processor registers, making performance limited by memory bandwidth.
We investigate whether High Bandwidth Memory with Processing-in-Memory (HBM--PIM) can serve as alternative implementation of AME instructions. We propose a PIM Execution Primitive (PEP) computational model mapping AME ISA onto Samsung Aquabolt-XL HBM-PIM microkernels, using an outer-product dataflow to enable in-memory accumulation, as well as remapping AME tile registers into memory regions—making possible to chain AME instructions without leaving the memory.
Our experiments show AME tile multiplication reaching 14.9 GFLOP/s (59.4 FLOP/cycle) on a HBM--PIM pseudo-channel, demonstrating that HBM--PIM can serve as an implementation of RISC-V matrix extensions.