Adeel Ahmad

I am a compiler engineer at 10xEngineers, working on enabling the compilation of LLMs and vision models for custom hardware/accelerators using IREE, an MLIR-based AI compiler. I have experience in writing optimized kernels for RISC-V Vector (RVV) and custom hardware, LLVM middle-end and backend development.


Sessions

06-11
13:30
10min
Optimizing IREE Compilation and End-to-End Object Detection Pipeline for RISC-V
Adeel Ahmad

This work enables optimized, end-to-end inference of the object detection models on RISC-V vector CPU. It includes the implementation of optimized pre- and post-processing pipelines as well as the enablement of efficient execution of the models at FP32, FP16, and INT8 precisions. IREE, an MLIR-based compiler, is used to compile and optimize the model. Model inference on the Banana Pi BPI-F3 is profiled to identify top hotspot ops and their compilation is optimized in the IREE compilation pipeline either by improving vectorization or by implementing ukernels. For accuracy validation, the mean Average Precision (mAP) is computed using the COCO validation dataset. This project is supported by the RISC-V Software Ecosystem (RISE), and all the developed artifacts are open-source.

Non-Blind submission
Poster Island C
06-11
17:00
15min
Optimizing Llama.cpp and GGML for RISC-V Vector (RVV)
Taimur Ahmad, Adeel Ahmad

Llama.cpp is a widely used open-source platform for running Large Language Models (LLMs) on CPUs, but its support for RISC-V remains limited compared to x86 and ARM. Many floating-point and quantized kernels lack RISC-V Vector (RVV) implementations, restricting the performance of existing hardware. This work improves the upstream RISC-V performance by vectorizing core floating-point kernels and extending support across multiple quantization types, enabling first-class support for RVV in Llama.cpp. VLEN-aware data repacking is introduced to accelerate GEMM and GEMV kernels for both floating point and quantization types. The optimized kernels are validated across VLENs up to 1024-bit, with benchmarking on Banana Pi BPI-F3 (256-bit VLEN) demonstrating considerable performance gains over upstream Llama.cpp. This work is supported by the RISC-V Software Ecosystem (RISE), with the vectorized kernels being upstreamed to Llama.cpp along with the test infrastructure.

Non-Blind submission
Plenary