Vishwa: A Scalable RISC-V Based GPGPU
2026-06-10 , Poster Island C

The growing demand for artificial intelligence, scientific computing, and large-scale data analytics has significantly increased the need for massively parallel computing architectures. Modern GPUs provide high computational throughput by executing thousands of concurrent threads, but most existing GPU architectures remain proprietary, limiting open architectural innovation and research. This paper presents Vishwa, a scalable RISC-V based General Purpose GPU (GPGPU) architecture designed to enable open and extensible parallel computing platforms. The architecture adopts a hierarchical compute model composed of Vishwa Compute Clusters (VCLs) containing multiple Vishwa Compute Cores (VCCs) that execute threads using a Single Instruction Multiple Thread (SIMT) execution model. Each compute core integrates specialised Vishwa Matrix Cores (VMCs) designed to accelerate matrix-intensive operations commonly used in machine learning workloads. Work distribution across the architecture is managed by a global Vishwa Work Distributor (VWD) that schedules workloads across available compute clusters. The architecture is supported by a complete software ecosystem through the CHAKRA compiler stack, which integrates with LLVM to provide kernel compilation and runtime execution support. The compute core architecture has been implemented and validated on an FPGA platform, demonstrating functional correctness of the execution pipeline and SIMT execution model.


The Vishwa architecture follows a hierarchical GPU design composed of a host interface, a global Vishwa Work Distributor (VWD), multiple Vishwa Compute Clusters (VCLs), and a hierarchical memory subsystem connected to high-bandwidth memory. Kernels launched by the host processor are distributed across compute clusters through the VWD, which dynamically assigns workloads to available clusters to maximise parallel utilisation. Each Vishwa Compute Cluster integrates multiple Vishwa Compute Cores (VCCs) along with shared resources such as register files, shared memory, scheduling hardware, and instruction and data caches. This organisation enables the architecture to support a large number of concurrent threads while effectively hiding memory latency through hardware multithreading.

The architecture consists of multiple Vishwa Compute Clusters interconnected through a shared cache hierarchy and supported by High Bandwidth Memory (HBM) to provide high-throughput data access. Each VCL integrates four Vishwa Compute Cores, forming a scalable compute unit capable of parallel execution. Every VCC can execute 32 threads in parallel, enabling fine-grained data parallelism across workloads. In addition, each VCC supports up to 16 pipelined thread groups, allowing overlapping execution and improving utilisation of the compute pipeline.

Ms. Prachi Pandey is a Senior Compiler Engineer at C-DAC, where she works on MLIR/LLVM-based compiler development for indigenous processors, GPUs, and AI accelerators. She has nearly two decades of experience in HPC, parallel programming, compilers, and runtime systems. Her research interests include compiler optimization techniques, automatic parallelizing compilers, performance portability for heterogeneous architectures, and parallelization strategies for HPC and AI workloads.

This speaker also appears in:

Scientist at Centre for Development of Advanced Computing (C-DAC), Bangalore, India