pytorch_block_sparse: BlockSparseLinear enables block-sparse LMs for smaller and faster models
AI Impact Summary
The pytorch_block_sparse extension introduces BlockSparseLinear as a drop-in replacement for torch.nn.Linear, enabling block-sparse matrices in PyTorch models to shrink size and speed when combined with distillation and quantization. It leverages CUDA/C++ templates via CUTLASS to approach cuBLAS-like kernels on block-sparse ops and reports ~2x slower performance than dense cuBLAS currently, with up to 4x memory savings at 75% sparsity; sparsity patterns are fixed at initialization, and future work aims to optimize patterns and exploit Ampere Tensor Cores. A model patcher tool (BlockSparseModelPatcher) allows on-the-fly modification without source changes, enabling production-ready deployment if performance targets align; note that official PyTorch support is still lacking, making this a community-driven option.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info