heywoodlh 12 hours ago

Ah, this is awesome! I currently run k3s on a decently spec-ed NixOS rig. I tried getting k3s to recognize my Nvidia GPU but was unsuccessful. I even used the small guide for getting GPU in k3s to work in nixpkgs[0], but without success.

For now I’m just using Docker’s Nvidia container runtime for containers that need GPU acceleration.

Will likely spend more time digging into your findings — hoping it results in me finding a solution to my setup!

[0] https://github.com/NixOS/nixpkgs/blob/master/pkgs/applicatio...

  • fangpenlin 12 hours ago

    There's a bug in k8s-device-plugin that stops the plugin from even launching, as I mentioned in the article:

    https://github.com/NVIDIA/k8s-device-plugin/issues/1182

    And I opened a PR for fixing that here:

    https://github.com/NVIDIA/k8s-device-plugin/pull/1183

    I am unsure if this bug is only for the NixOS environment because its library paths and other quicks differ from those of major Linux distros.

    Another major problem was that the "default_runtime_name" in the Containerd config didn't work as expected. I had to create a RuntimeClass and assign it to the pod to make it pick up the Nvidia runtime.

    Other than that, I haven't tried K3S, the one I am running is a full-blown K8S cluster. I guess they should be similar.

    While there's no guarantee, if you find any hints showing why your Nvidia plugin won't work here, I might be able to help, as I skip some minor issues I encountered in the articles. If it happens to be the ones I faced, I can share how I solved them.

    • fangpenlin 12 hours ago

      By the way, one of the problems I encountered but didn't mention in the article was that the libnvidia-container has problem with the pathes for reading nvidia drivers and libraries under NixOS with its non-POSIX pathes. I had to create a patch for modifying the path files. I just created a Gist here with the patch content:

      https://gist.github.com/fangpenlin/1cc6e80b4a03f07b79412366b...

      But later on, since I am taking the CDI route, it appears that the libnvidia-container (nvidia-container-cli) is not really used. If you are going with just container runtime approach instead of CDI, you may need a patch like this for the libnvidia-container package.

      • heywoodlh 12 hours ago

        Oooo, thanks for the pointers! Will be revisiting this tomorrow!

colordrops 18 hours ago

This looks fun. The author mentions machine learning workloads. What are typical machine learning use cases for a cluster of lower end GPUs?

While on that topic, why must large model inferencing be done on a single large GPU and/or bank of memory rather than a cluster of them? Is there promise of being able to eventually run large models on clusters of weaker GPUs?

  • thangngoc89 15 hours ago

    The bottleneck on distributed GPUs training/inference is the inter-GPU connections speed. For a single node, it's doable because it utilized PCIe 4.0 connections. For a cluster, you need at least 50Gbps connection between nodes, which is expensive for cheap GPUs.

    • fangpenlin 13 hours ago

      For training, yes, you will need to share the parameters (i.e., weights and bias); the number is huge. But for inference, you don't need that much high bandwidth to run it in a distributed manner.

      According to the author of Exo https://blog.exolabs.net/day-1/:

      > When Shard A finishes processing its layers, it produces an activation that gets passed to Shard B over whatever network connection is available. In general these activations are actually quite small - for Llama 3.2 3B they are less than 4KB. They scale approximately linearly with the size of the layers. Therefore the bottleneck here is generally the latency between devices, not the bandwidth (a common misconception).

      I think that makes sense because the activations are the numbers coming out of the whole neuron network (or part of it). Compared to the number of parameters, it's not at the same magnitude.

  • fangpenlin 17 hours ago

    You can check Exo out:

    https://github.com/exo-explore/exo

    It's a project designed to run a large model in a distributed manner. My need for GPU is to run my own machine learning research pet project (mostly evolutionary neuron network models for now), and it's a bit different from inferencing needs. Training is yet another different story.

    But yeah, I agreed. I think machine learning should be distributed more in the future.

    • colordrops 15 hours ago

      Exo looks awesome, exactly what I had in mind, thank you.

imcritic 18 hours ago

All the links are styled in an unreadable form on that page.

  • fangpenlin 17 hours ago

    Hi, I'm the author. Usually, I provide links to the new terms mentioned in the article so that interested people can click and learn more. I have changed the link color from blue to dark gray. Would it help? Or are there just too many links in general?

    • RestartKernel 12 hours ago

      If possible, you shouldn't rely on colour alone for important cues. You could try underlining them, since that's often the expected behaviour for URLs anyhow.

  • ajdude 17 hours ago

    Several of them seem to be 404ing for me

    • fangpenlin 17 hours ago

      Broken links should be fixed right now. Sorry about that.