Lessons Learned from Designing Decoupled-Access Hardware Accelerators in a RISC-V Framework
Sparse tensor operations are critical for scientific computing but their irregular memory access patterns challenge traditional architectures. While domain-specific architectures offer efficiency, integration into mature SoCs often requires ISA modifications or complex driver development. This work addresses these challenges via a decoupled SpMV access unit integrated through Cohort, a coherent shared-memory queue interface communicating with a CVA6 RISC-V core. To mitigate the inter-tile communication overhead, we introduce a hybrid tiling approach that co-locates the access unit and the core in the same tile, enabling direct data delivery into the private cache. This hybrid architecture achieves significant performance gains, yielding geometric mean speedups of 1.33× and 1.50× for COO and CSR formats, respectively, over traditional multi-tile configurations. These results demonstrate that offloading memory traversal to a programmable data-flow engine, combined with optimized placement in the memory hierarchy, efficiently accelerates irregular workloads with minimal intrusion.