Optimizing Llama.cpp and GGML for RISC-V Vector (RVV)
Taimur Ahmad, Adeel Ahmad
Llama.cpp is a widely used open-source platform for running Large Language Models (LLMs) on CPUs, but its support for RISC-V remains limited compared to x86 and ARM. Many floating-point and quantized kernels lack RISC-V Vector (RVV) implementations, restricting the performance of existing hardware. This work improves the upstream RISC-V performance by vectorizing core floating-point kernels and extending support across multiple quantization types, enabling first-class support for RVV in Llama.cpp. VLEN-aware data repacking is introduced to accelerate GEMM and GEMV kernels for both floating point and quantization types. The optimized kernels are validated across VLENs up to 1024-bit, with benchmarking on Banana Pi BPI-F3 (256-bit VLEN) demonstrating considerable performance gains over upstream Llama.cpp. This work is supported by the RISC-V Software Ecosystem (RISE), with the vectorized kernels being upstreamed to Llama.cpp along with the test infrastructure.
Non-Blind submission
Plenary