Nvidia GPU on bare metal NixOS Kubernetes cluster explained

41 points by fangpenlin 7 months ago

This looks fun. The author mentions machine learning workloads. What are typical machine learning use cases for a cluster of lower end GPUs?

While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?

thangngoc89 7 months ago

The bottleneck on distributed GPUs training/inference is the inter-GPU connections speed. For a single node, it's doable because it utilized PCIe 4.0 connections. For a cluster, you need at least 50Gbps connection between nodes, which is expensive for cheap GPUs.
- fangpenlin 7 months ago
  
  For training, yes, you will need to share the parameters (i.e., weights and bias); the number is huge. But for inference, you don't need that much high bandwidth to run it in a distributed manner.
  According to the author of Exo https://blog.exolabs.net/day-1/:
  > When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).
  I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.
fangpenlin 7 months ago

You can check Exo out:
https://github.com/exo-explore/exo
It's a project designed to run a large model in a distributed manner. My need for GPU is to run my own machine learning research pet project (mostly evolutionary neuron network models for now), and it's a bit different from inferencing needs. Training is yet another different story.
But yeah, I agreed. I think machine learning should be distributed more in the future.
- colordrops 7 months ago
  
  Exo looks awesome, exactly what I had in mind, thank you.

heywoodlh 7 months ago

Ah, this is awesome! I currently run k3s on a decently spec-ed NixOS rig. I tried getting k3s to recognize my Nvidia GPU but was unsuccessful. I even used the small guide for getting GPU in k3s to work in nixpkgs[0], but without success.

For now I’m just using Docker’s Nvidia container runtime for containers that need GPU acceleration.

Will likely spend more time digging into your findings — hoping it results in me finding a solution to my setup!

[0] https://github.com/NixOS/nixpkgs/blob/master/pkgs/applicatio...

fangpenlin 7 months ago

There's a bug in k8s-device-plugin that stops the plugin from even launching, as I mentioned in the article:
https://github.com/NVIDIA/k8s-device-plugin/issues/1182
And I opened a PR for fixing that here:
https://github.com/NVIDIA/k8s-device-plugin/pull/1183
I am unsure if this bug is only for the NixOS environment because its library paths and other quicks differ from those of major Linux distros.
Another major problem was that the "default_runtime_name" in the Containerd config didn't work as expected. I had to create a RuntimeClass and assign it to the pod to make it pick up the Nvidia runtime.
Other than that, I haven't tried K3S, the one I am running is a full-blown K8S cluster. I guess they should be similar.
While there's no guarantee, if you find any hints showing why your Nvidia plugin won't work here, I might be able to help, as I skip some minor issues I encountered in the articles. If it happens to be the ones I faced, I can share how I solved them.
- fangpenlin 7 months ago
  
  By the way, one of the problems I encountered but didn't mention in the article was that the libnvidia-container has problem with the pathes for reading nvidia drivers and libraries under NixOS with its non-POSIX pathes. I had to create a patch for modifying the path files. I just created a Gist here with the patch content:
  https://gist.github.com/fangpenlin/1cc6e80b4a03f07b79412366b...
  But later on, since I am taking the CDI route, it appears that the libnvidia-container (nvidia-container-cli) is not really used. If you are going with just container runtime approach instead of CDI, you may need a patch like this for the libnvidia-container package.
  - heywoodlh 7 months ago
    
    Oooo, thanks for the pointers! Will be revisiting this tomorrow!

imcritic 7 months ago

All the links are styled in an unreadable form on that page.

fangpenlin 7 months ago

Hi, I'm the author. Usually, I provide links to the new terms mentioned in the article so that interested people can click and learn more. I have changed the link color from blue to dark gray. Would it help? Or are there just too many links in general?
- imcritic 6 months ago
  
  The grey color made them a bit more readable, but it looks like either fonts are broken (the text is horizontally doubling!) or it's some font shadow CSS property poorly chosen.
  The fact that you don't see the problem immediately makes me also guess if it's just rendering that way only on mobiles or in my browser*os (Cromite on Android) or maybe a result of some resource (js/CSS/woff) getting blocked by the built in adblocker or something.
- RestartKernel 7 months ago
  
  If possible, you shouldn't rely on colour alone for important cues. You could try underlining them, since that's often the expected behaviour for URLs anyhow.
ajdude 7 months ago

Several of them seem to be 404ing for me
- fangpenlin 7 months ago
  
  Broken links should be fixed right now. Sorry about that.