Optimizing Llama.cpp and GGML for RISC-V Vector (RVV)
2026-06-11 , Plenary

Llama.cpp is a widely used open-source platform for running Large Language Models (LLMs) on CPUs, but its support for RISC-V remains limited compared to x86 and ARM. Many floating-point and quantized kernels lack RISC-V Vector (RVV) implementations, restricting the performance of existing hardware. This work improves the upstream RISC-V performance by vectorizing core floating-point kernels and extending support across multiple quantization types, enabling first-class support for RVV in Llama.cpp. VLEN-aware data repacking is introduced to accelerate GEMM and GEMV kernels for both floating point and quantization types. The optimized kernels are validated across VLENs up to 1024-bit, with benchmarking on Banana Pi BPI-F3 (256-bit VLEN) demonstrating considerable performance gains over upstream Llama.cpp. This work is supported by the RISC-V Software Ecosystem (RISE), with the vectorized kernels being upstreamed to Llama.cpp along with the test infrastructure.


This work is performed under RISE RP-014 - Optimizing Llama.cpp and GGML for RVV. All artifacts are open source and either upstreamed to Llama.cpp or in the process of being upstreamed. This contribution not only improves baseline RISC-V vector hardware performance for AI workloads, enabling adoption among AI developers, but also provides first-class software infrastructure support for RISC-V hardware makers to test, benchmark, and optimize standardized hardware solutions for AI workloads.

I am a compiler engineer at 10xEngineers, working on enabling the compilation of LLMs and vision models for custom hardware/accelerators using IREE, an MLIR-based AI compiler. I have experience in writing optimized kernels for RISC-V Vector (RVV) and custom hardware, LLVM middle-end and backend development.

This speaker also appears in: