Hardware support in RISC-V for ternary LLMs
2026-06-11 , Poster Island C

Language models are becoming increasingly common, and their number of parameters is continuously increasing, imposing huge memory capacities. One of the most common techniques to reduce their memory footprint is weight quantization. Ternary models are one of the most extreme cases of quantization. So far, most hardware proposals focus on FPGA-based accelerators to optimize inference in quantized models, while current general-purpose processors have limited support (up to 8-bit integers). In this work we attempt a preliminary analysis of the potential benefits of moving the quantization hardware support directly to the processor. To do so, we make use of a state-of-the-art inference framework for CPUs and Small Language Models, evaluating what the competitive advantages of having dedicated SIMD hardware for quantized operations. The results show a speedup x2 (tokens/s) on a 350MB Small Language Model with a tendency to increase the speedup with the model size, using a minimal increase of the hardware resources (1.25% in LUTs).