2026-06-09 –, Poster Island C
The fragmented RISC-V ecosystem demands portable, high-performance code generation for the Vector Extension (RVV 1.0). Upstream MLIR (LLVM 22.0) lacks two critical lowering stages needed for this: it cannot flatten dynamic memref ma- trix references into C pointers, nor emit Vector-Length-Agnostic (VLA) RVV intrinsics. This paper closes that gap with a six-stage hybrid MLIR–xDSL compilation workflow that automatically generates parameterized, hardware-aware C micro- kernels for GEMM entirely in Python, without modifying the MLIR C++ codebase. On a COTS BananaPi F3 board (SpaceMiT K1, 256-bit RVV 1.0), we show: (i) isolated micro-kernels match or exceed hand-written reference code (0.98×– 1.05×), peaking at 16.2 GFLOPS at the optimal 16×15 tile; (ii) on BERT-Large transformer layers (B1–B5), generated micro-kernels consistently surpass OpenBLAS, reaching up to 12.2 GFLOPS against the baseline’s 5.1 GFLOPS (a 2.4× speedup) and maintaining an average 15–27% performance advantage across all layer dimensions.