janalsncm 6 days ago

Article is from 2016. It only mentions AdamW at the very end in passing. These days I rarely see much besides AdamW in production.

Messing with optimizers is one of the ways to enter hyperparameter hell: it’s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.

  • pizza 6 days ago

    Luckily we have Shampoo, SOAP, Modula, Schedule-free variants, and many more these days being researched! I am very very excited by the heavyball library in particular

    • 3abiton 5 days ago

      Been out of the loop for while, anything exciting?

      • pizza 4 days ago

        Read the Modular Norms in Deep Learning paper, and follow the author of the heavyball library on twitter with notifications enabled

  • ImageXav 6 days ago

    Something that stuck out to me in the updated blog [0] is that Demon Adam performed much better than even AdamW, with very interesting learning curves. I'm wondering now why it didn't become the standard. Anyone here have insights into this?

    [0] https://johnchenresearch.github.io/demon/

    • gzer0 6 days ago

      Demon Adam didn’t become standard largely for the same reason many “better” optimizers never see wide adoption: it’s a newer tweak, not clearly superior on every problem, is less familiar to most engineers, and isn’t always bundled in major frameworks. By contrast, AdamW is now the “safe default” that nearly everyone supports and knows how to tune, so teams stick with it unless they have a strong reason not to.

      Edit: Demon involves decaying the momentum parameter over time, which introduces a new schedule or formula for how momentum should be reduced during training. That can feel like additional complexity or a potential hyperparameter rabbit hole. Teams trying to ship products quickly often avoid adding new hyperparameters unless the gains are decisive.

sega_sai 6 days ago

Interesting, but it does not seem to be an overview of gradient optimisers, but rather gradient optimisers in ML, as I see no mentions of BFGS and the likes.

  • VHRanger 6 days ago

    I'm also curious about gradient-less algorithms

    For non deep learning applications, Nelder-Mead saved my butt a fees times

    • analog31 6 days ago

      It's with the utmost humility that I confess to falling back on "just use Nelder-Mead" in *scipy.optimize* when something is ill behaved. I consider it to be a sign that I'm doing something wrong, but I certainly respect its use.

    • imurray 6 days ago

      Nelder–Mead has often not worked well for me in moderate to high dimensions. I'd recommend trying Powell's method if you want to quickly converge to a local optimum. If you're using scipy's wrappers, it's easy to swap between the two:

      https://docs.scipy.org/doc/scipy/reference/optimize.html#loc...

      For nastier optimization problems there are lots of other options, including evolutionary algorithms and Bayesian optimization:

      https://facebookresearch.github.io/nevergrad/

      https://github.com/facebook/Ax

    • amelius 6 days ago

      ChatGPT also advised me to use NM a couple of times, which was neat.

    • woadwarrior01 6 days ago

      Look into zeroth-order optimizers and CMA-ES.

  • mike-the-mikado 6 days ago

    I think the big difference is dimensionality. If the dimensionality is low, then taking account of the 2nd derivatives becomes practical and worthwhile.

    • juliangoldsmith 6 days ago

      What is it that makes higher order derivatives less useful at high dimensionality? Is it related to the Curse of Dimensionality, or maybe something like exploding gradients at higher orders?

      • mike-the-mikado 5 days ago

        In n dimensions, the first derivative is an n-element vector. The second derivative is an n x n (symmetric) matrix. As n grows, the computation required to estimate the matrix increases (as at least n^2) and computation needed to use it increases (possibly faster).

        In practice, clever optimisation algorithms that use the 2nd derivative won't actually form this matrix.

ipunchghosts 6 days ago

Example of thr bitter lesson. None of these nuanced matter 8 years later where everyone uses sgd or adamw.

sk11001 6 days ago

It's a great summary for ML interview prep.

  • janalsncm 6 days ago

    I disagree, it is old and most of those algorithms aren’t used anymore.

    • sk11001 6 days ago

      That’s how interviews go though, it’s not like I’ve ever had to use Bayes rule at work but for a few years everyone loved asking about it in screening rounds.

      • mike-the-mikado 6 days ago

        In my experience a lot of people "know" maths, but fail to recognise the opportunities to use it. Some of my colleagues were pleased when I showed them that their ad hoc algorithm was equivalent to an application of Bayes' rule. It gave them insights into the meaning of constants that had formerly been chosen by trial and error.

      • janalsncm 6 days ago

        Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.

        Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.

        • yobbo 6 days ago

          For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.

          As a model is trained, the gradient variance typically falls.

          Those optimizers all work to reduce the variance of the updates in various ways.

      • esafak 6 days ago

        I'd still expect an MLE to know it though.

        • janalsncm 6 days ago

          Why would you? Implementing optimizers isn’t something that MLEs do. Even the Deepseek team just uses AdamW.

          An MLE should be able to look up and understand the differences between optimizers but memorizing that information is extremely low priority compared with other information they might be asked.