Sparse Systolic Array ML Accelerator

This repository contains a 16 x 16 INT8 systolic-array matrix-multiply accelerator written in SystemVerilog, with directed verification, benchmark testbenches, CPU comparison code, and project-specific ASIC flow configuration targeting SKY130 through Hammer, Cadence Genus, Innovus, Tempus, and Voltus.

The project explores the performance, energy, and physical-design tradeoffs of a small matrix-multiply accelerator for machine-learning workloads.

Full technical report: Systolic Array ML Accelerator Report

Highlights

16 x 16 systolic array of processing elements
INT8 activation and weight inputs
INT32 partial-sum accumulation
Weight-stationary dataflow
Ping-pong scratchpad for activation and weight tile storage
Output accumulation buffer for accumulating across K tiles
Sparse activation detection with multiplier/adder switching reduction
Rectangular GEMM benchmark support for ML-style shapes
ASIC implementation in SKY130 with post-route timing, area, and power reports

Architecture

At a high level, the accelerator contains:

accel_top: top-level integration
dma_ctrl: tile-load and compute controller
scratchpad: ping-pong activation/weight storage
sparsity_unit: zero-activation detection and skip-signal generation
array_top: 16 x 16 systolic PE grid
pe: pipelined INT8 multiply and INT32 accumulate processing element
output_buffer: C-tile accumulation and host readback

The accelerator computes tiled GEMMs of the form:

C[M, N] += A[M, K] x B[K, N]

The hardware operates on 16 x 16 tiles. Larger rectangular GEMMs are handled by the testbench/software loop, which feeds the accelerator one tile product at a time and uses the output buffer to accumulate partial sums across K tiles.

Sparse Mode

Sparse mode detects zero activations and suppresses the corresponding MAC datapath activity. In the current design, sparsity reduces switching power but does not reduce cycle count because the controller still follows the dense tile schedule.

Measured post-PAR Voltus activity-based power showed reduced switching power as activation sparsity increased:

Mode	Total Power	Switching Power
Dense	538.295 mW	160.672 mW
Sparse 50%	533.070 mW	157.904 mW
Sparse 75%	524.987 mW	153.197 mW
Sparse 90%	515.493 mW	147.764 mW

Verification

Verification uses directed SystemVerilog testbenches simulated with Synopsys VCS. Each major block has a dedicated testbench:

Testbench	Purpose
`tb/pe_tb.sv`	PE MAC, skip behavior, and pipeline timing
`tb/array_top_tb.sv`	Systolic-array data movement and psum output
`tb/scratchpad_tb.sv`	Banked scratchpad write/read and ping-pong swap
`tb/dma_ctrl_tb.sv`	Controller state sequencing and output controls
`tb/sparsity_unit_tb.sv`	Zero detection, skip alignment, and counters
`tb/accel_top_tb.sv`	Integrated 16 x 16 GEMM against a golden model
`tb/accel_benchmark_tb.sv`	Square tiled GEMM benchmarks
`tb/accel_rect_benchmark_tb.sv`	Rectangular ML-style GEMM benchmarks

The rectangular benchmark supports:

+M=<rows> +K=<inner dimension> +N=<columns>
+MODE=<dense|sparse50|sparse75|sparse90>
+CHECK=<full|sample|checksum|none>

Selected validation cases passed with zero mismatches, including dense and sparse rectangular GEMMs.

Performance Summary

Accelerator runtime is reported from simulated cycle count at a 5.0 ns clock period:

hardware_time = cycles x 5.0 ns

Representative accelerator results:

Shape	Time / Matmul	Throughput	Avg Power	Energy Efficiency
128 x 128 x 128	0.353 ms	5.936 GMAC/s	0.5381 W	11.03 GMAC/J
256 x 256 x 256	2.662 ms	6.302 GMAC/s	0.5381 W	11.71 GMAC/J
1024 x 768 x 768	91.914 ms	6.571 GMAC/s	0.5381 W	12.21 GMAC/J
1024 x 768 x 3072	367.657 ms	6.571 GMAC/s	0.5381 W	12.21 GMAC/J
1024 x 3072 x 768	361.759 ms	6.678 GMAC/s	0.5381 W	12.41 GMAC/J

The table below summarizes the measured CPU comparison for two representative systems. The CPU baseline is a simple C implementation, not an optimized BLAS/GEMM library, so it should be treated as a straightforward implementation baseline rather than peak CPU performance.

Shape	Accelerator	Intel i5-1335U	Ryzen 9 7900X
128 x 128 x 128	5.936 GMAC/s, 11.03 GMAC/J	6.97 GMAC/s, 0.460 GMAC/J	14.47 GMAC/s, 0.182 GMAC/J
256 x 256 x 256	6.302 GMAC/s, 11.71 GMAC/J	8.94 GMAC/s, 0.518 GMAC/J	14.42 GMAC/s, 0.168 GMAC/J
1024 x 768 x 768	6.571 GMAC/s, 12.21 GMAC/J	7.84 GMAC/s, 0.456 GMAC/J	13.45 GMAC/s, 0.174 GMAC/J
1024 x 768 x 3072	6.571 GMAC/s, 12.21 GMAC/J	8.09 GMAC/s, 0.430 GMAC/J	14.30 GMAC/s, 0.184 GMAC/J
1024 x 3072 x 768	6.678 GMAC/s, 12.41 GMAC/J	5.38 GMAC/s, 0.309 GMAC/J	9.66 GMAC/s, 0.119 GMAC/J

ASIC Results

The final reported ASIC implementation was generated with the 4.9 ns timing configuration in asic/cfg/timing_4p9.yml. Generated build directories are not committed by default; key results are summarized below.

Timing signoff was performed with Tempus using a 5.0 ns clock target:

Check	Corner	Worst Slack	Status
Setup	ss 100 C / 1.60 V	+0.057 ns	PASS
Hold	ff -40 C / 1.95 V	+0.175 ns	PASS

Physical implementation summary:

Metric	Value
Clock target	5.0 ns / 200 MHz
Die size	5 mm x 5 mm
Cell area	2.262 mm2
Density	8.6%
Instances	231 K leaf cells
Activity-based dense power	about 538 mW

Physical verification status:

DRC: no rule violations listed in the Magic DRC output
LVS: internal subcircuits match, but a top-level VDD/VSS power-pin mismatch remains in the Netgen LVS report

Repository Layout

source/                                  SystemVerilog RTL
tb/                                      SystemVerilog testbenches
benchmarks/                              CPU benchmark source and notes
scripts/                                 helper scripts and golden-model utilities
asic/                                    Hammer/Cadence ASIC flow configuration
Systolic_Array_Architecture_Design_Document.md
                                        combined design, verification, and result report

Running RTL Simulations

The ASIC flow uses Hammer-generated VCS run directories. Typical simulation commands are run from the relevant generated simulation directory:

./simv +M=1024 +K=768 +N=768 +MODE=dense +CHECK=checksum

For the rectangular benchmark:

./simv +M=<M> +K=<K> +N=<N> +MODE=<dense|sparse50|sparse75|sparse90> +CHECK=<full|sample|checksum|none>

Running the ASIC Flow

Run commands from asic/ after connecting the project configuration to a compatible Hammer/SKY130/Cadence environment. The repository includes the project-specific RTL, testbenches, and Hammer YAML/TCL configuration files, but does not include the external CAD framework or licensed tool setup.

cd asic
make syn
make par

The final project build used the 4.9 ns timing configuration:

make OBJ_DIR=build_runs/sparse_mul_gate_4p9ns \
  INPUT_CFGS="cfg/systolic_array_cfg.yml cfg/systolic_array_src.yml cfg/timing_4p9.yml" \
  syn

make OBJ_DIR=build_runs/sparse_mul_gate_4p9ns \
  INPUT_CFGS="cfg/systolic_array_cfg.yml cfg/systolic_array_src.yml cfg/timing_4p9.yml" \
  par

Notes and Limitations

Sparse mode currently reduces switching activity, not runtime.
Matrix dimensions used by the benchmark flow are multiples of 16.
Larger GEMMs are tiled by the testbench/software loop rather than by an autonomous hardware DMA engine.
CPU benchmark comparisons are against simple C code, not optimized BLAS.
Power comparison is not perfectly apples-to-apples: CPU power is package/PPT measured power, while accelerator power is post-PAR EDA-estimated chip power.
The current floorplan is conservative and not area-optimized.

Future Work

Add true sparse scheduling to skip zero tiles/cycles
Add boundary masking or zero-padding for non-multiple-of-16 shapes
Integrate a real system memory/DMA interface
Optimize the floorplan for smaller die area and shorter routes
Compare against optimized CPU BLAS and GPU GEMM libraries

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
asic		asic
basejump_stl @ c5c278f		basejump_stl @ c5c278f
benchmarks		benchmarks
scripts		scripts
source		source
tb		tb
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparse Systolic Array ML Accelerator

Highlights

Architecture

Sparse Mode

Verification

Performance Summary

ASIC Results

Repository Layout

Running RTL Simulations

Running the ASIC Flow

Notes and Limitations

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sparse Systolic Array ML Accelerator

Highlights

Architecture

Sparse Mode

Verification

Performance Summary

ASIC Results

Repository Layout

Running RTL Simulations

Running the ASIC Flow

Notes and Limitations

Future Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages