A Holistic Approach to Attached Matrix Extension on RISC-V From ISA to Software Stack
2026-06-11 , Poster Island B

The increasing computational demands of modern AI workloads necessitate a holistic architectural approach to AI acceleration on RISC-V processors. This talk presents the XuanTie Tensor Processing Engine (TPE), a RISC-V-based Attached Matrix Extension (AME) engine designed to address AI acceleration across three dimensions: ISA, microarchitecture, and software ecosystem.
At the ISA level, the TPE adopts the in-progress RISC-V AME specification, featuring dedicated tensor registers and a comprehensive instruction set encompassing matrix multiply-accumulate, element-wise, special function, reduction, and load/store operations with broad data type support including INT4, FP8, FP16, and micro-scaling formats.
At the microarchitecture level, the design incorporates a matrix engine achieving 2 TOPS/GHz at INT8/FP8, a concurrent vector engine with hardware-accelerated non-linear functions, and a layered memory subsystem featuring a coherent tensor cache and data prefetch engine.
A full-stack software ecosystem spanning LLVM toolchain to graph execution runtime completes the solution.
Experimental results on the XuanTie C930 cluster demonstrate 99% FP16 GEMM utilization. We discuss key design trade-offs and implications for the evolving RISC-V AME standard.


The XuanTie TPE demonstrates a production-grade, RISC-V-native Attached Matrix Extension approach to AI acceleration that is both performant and practical. Through this work, we aim to contribute to the RISC-V community in several important ways:
• Driving Community-Driven Standardization of AME: Our implementation is closely aligned with the RISC-V AME TG's ongoing AME specification efforts. By sharing our design choices, the trade-offs we encountered, and the real-world workload requirements that shaped our ISA decisions, we hope to provide valuable feedback to the standardization process. We believe that practical, silicon-validated implementations are essential for grounding specification discussions in engineering reality.
• Highlighting the Unique Advantages of RISC-V for AI Extension: The openness and modularity of RISC-V make it uniquely suited for domain-specific acceleration. Unlike proprietary ISAs where AI extensions must fit within rigid architectural constraints, RISC-V allows the community to co-evolve the ISA, microarchitecture, and software stack together. The TPE is a concrete example of this co-design philosophy — the AME ISA, the hardware engines, and the software ecosystem were developed in concert, each informing and refining the others.

Qiu Jing is a CPU design engineer at Alibaba DAMO Academy, where he has been involved in the design of multiple XuanTie RISC-V processors. He served as the Acting Chair during the inception phase of the RISC-V AME (Attached Matrix Extension) Task Group, and has been actively contributing to the TG's discussions since its establishment. His work focuses on bridging the gap between AI workload requirements and RISC-V ISA design, with particular emphasis on vector/matrix extension architecture and its hardware implementation.