From Profiling to Performance: Optimizing Small Language Models on RISC‑V Architectures
Small Language Models (SLMs) are increasingly critical for edge AI, yet their performance on RISC-V requires rigorous profiling to identify architectural bottlenecks. This work evaluates the performance of SLMs including Gemma3, Llama-3.2, Qwen-2.5, DeepSeek, and Phi-3.5 on the Tenstorrent Ascalon RISC-V Core. We developed a profiling methodology to analyze workload distribution, which revealed that Matrix Multiplication (MatMul) contributes ~90% of total compute across all evaluated models. Given the computational complexity of running full-model emulations, we extract these critical kernels for targeted benchmarking. Our implementation on the HAPS platform achieves significant performance leaps over standard baselines. FP32 execution, utilized for maximum precision, was optimized by transitioning from traditional SGEMM to a new high-performance implementation. Simultaneously, INT8 performance, targeted for efficient inference, was accelerated by migrating from standard RVV to a specialized IGEMM (with a VQDOT) implementation.