pytorch_block_sparse: BlockSparseLinear and on-the-fly patching for smaller, faster LMs via CUTLASS
AI Impact Summary
The release introduces the pytorch_block_sparse extension, providing BlockSparseLinear as a drop-in replacement for torch.nn.Linear and a BlockSparseModelPatcher to modify models on the fly, enabling smaller and faster language models when combined with distillation and quantization. Performance notes show sparse matmuls are currently roughly two times slower than cuBLAS-dense, but at 75% sparsity memory usage drops around 4x and relative speedups over the dense equivalent can reach ~2x, indicating meaningful gains in memory and throughput for highly sparse configurations. Caveats include sparsity patterns fixed at initialization, no official PyTorch support yet, and reliance on CUTLASS/CUDA with anticipated Tensor Core benefits; expect hardware and benchmark-driven considerations for production adoption.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info