slacka 5 hours ago

Very interesting model. Some key points from the blog:

* NVIDIA is also releasing most of the data they used to create it, including the pretraining corpus

* The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL.

At this size and with only 4 attention layers, it should run very fast locally on cheap 12GB GPUs.