Rama Malladi


Sessions

06-09
13:00
10min
Why Edges Matter: A Case Study on Performance Improvements for OpenBLAS GEMM on RISC-V
Rama Malladi, Chip Kerchner

Matrix multiplication (GEMM) sits at the heart of scientific computing, data analytics, and modern AI workloads. While much attention is given to peak throughput and ideal matrix sizes, real-world performance often hinges on the “edges” i.e., non-ideal dimensions, cache boundaries, and vector tail cases that quietly dominate execution time. In this paper, we present a practical case study of optimizing GEMM in OpenBLAS for RISC-V vector architectures. We show how careful handling of edge conditions, cache reuse, and vectorization strategy can deliver measurable performance gains. Techniques include maximizing cache and register reuse with single-pass data traversal, swapping operands and deferring transposition for easier storage, combining full- and half-vector operations with scalar instructions to efficiently handle irregular dimensions, and leveraging strided segmented load/store vector intrinsics to sustain throughput even in non-ideal layouts. These optimizations are not just academic; small inefficiencies in GEMM propagate directly into AI inference latency and energy. By focusing on edge cases and architectural nuance, we can unlock meaningful improvements for real-world workloads. These optimizations give substantial gains; for example, a 6 x 3072 × 3072 SGEMM MatMul efficiency improves from 23.5% to 68.7% of the peak.

Non-Blind submission
Poster Island C
06-09
14:10
10min
From Profiling to Performance: Optimizing Small Language Models on RISC‑V Architectures
Dongjie Xie, Rama Malladi, Jose Arnau, Chip Kerchner

Small Language Models (SLMs) are increasingly critical for edge AI, yet their performance on RISC-V requires rigorous profiling to identify architectural bottlenecks. This work evaluates the performance of SLMs including Gemma3, Llama-3.2, Qwen-2.5, DeepSeek, and Phi-3.5 on the Tenstorrent Ascalon RISC-V Core. We developed a profiling methodology to analyze workload distribution, which revealed that Matrix Multiplication (MatMul) contributes ~90% of total compute across all evaluated models. Given the computational complexity of running full-model emulations, we extract these critical kernels for targeted benchmarking. Our implementation on the HAPS platform achieves significant performance leaps over standard baselines. FP32 execution, utilized for maximum precision, was optimized by transitioning from traditional SGEMM to a new high-performance implementation. Simultaneously, INT8 performance, targeted for efficient inference, was accelerated by migrating from standard RVV to a specialized IGEMM (with a VQDOT) implementation.

Non-Blind submission
Poster Island C