This work details a high-performance Vector Processing Unit (VPU) architecture designed to exploit data-Level Parallelism (DLP) within the strict power and area constraints of embedded environments. Addressing the parallelization needs of data-intensive tasks, the proposed modular architecture implements a subset of the RISC-V Vector (RVV) Zve32x sub-extension, focusing on essential 32-bit integer operations. The VPU is integrated as a co-processor to a CV32E20 core within the eXtendable Heterogeneous Energy-efficient Platform (X-HEEP) ecosystem. It leverages the Core-V eXtension Interface (CV-X-IF) 1.0 for low-latency instruction offloading and the Open Bus Interface (OBI) v1.0 protocol to ensure high-throughput data memory access during load/store operations. The implementation, featuring a Vector Register Length (VLEN) of 128 bits, was validated through Register Transfer Level (RTL) simulation and in hardware using a Xilinx Pynq-Z2 FPGA. Performance was evaluated using standard data-parallel kernels including SAXPY, Indexed Arithmetic, and Matrix Multiplication (Matmul). Additionally, this research investigates the RISC-V GNU Compiler Toolchain, comparing standard C auto-vectorization against manual vectorization using RISC-V Vector C Intrinsics.