Show HN: Run 30B model in 4GB Active Memory

4 points by vkkhare a day ago

We have built fused operator kernels for structured contextual sparsity to avoid loading and computing activations with feed forward layer weights that eventually zero out by the activation.

The result? We are seeing 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):

- Time to First Token (TTFT): 1.51× faster (1.209s → 0.803s) - Output Generation Speed: 1.79× faster (0.7 → 1.2 tokens/sec) - Total Throughput: 1.78× faster (0.7 → 1.3 tokens/sec) - Memory Usage: 26.4% reduction (6.125GB → 4.15GB)

Find the operator kernels with differential weight caching open sourced at github.com/NimbleEdge/sparse_transformers. Lets get LLMs sprinting!