Dongjie Xie
Sessions
Small Language Models (SLMs) are increasingly critical for edge AI, yet their performance on RISC-V requires rigorous profiling to identify architectural bottlenecks. This work evaluates the performance of SLMs including Gemma3, Llama-3.2, Qwen-2.5, DeepSeek, and Phi-3.5 on the Tenstorrent Ascalon RISC-V Core. We developed a profiling methodology to analyze workload distribution, which revealed that Matrix Multiplication (MatMul) contributes ~90% of total compute across all evaluated models. Given the computational complexity of running full-model emulations, we extract these critical kernels for targeted benchmarking. Our implementation on the HAPS platform achieves significant performance leaps over standard baselines. FP32 execution, utilized for maximum precision, was optimized by transitioning from traditional SGEMM to a new high-performance implementation. Simultaneously, INT8 performance, targeted for efficient inference, was accelerated by migrating from standard RVV to a specialized IGEMM (with a VQDOT) implementation.
This paper tells the incremental journey of taking Tenstorrent’s Ascalon RISC‑V CPU IP from RTL and emulation to a playable DOOM demo on a Synopsys’s prototyping platform. Along the way we describe the problems we overcame, and how we optimized our flows and the design. We close with a set of lessons and recommendations for teams who want to use emulation and prototyping and realistic workloads like DOOM to de‑risk RISC‑V IP adoption and accelerate hardware/software co‑design.