An overview of gradient descent optimization algorithms (2016)

135 points by skidrow 6 months ago

Article is from 2016. It only mentions AdamW at the very end in passing. These days I rarely see much besides AdamW in production.

Messing with optimizers is one of the ways to enter hyperparameter hell: it’s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.

nkurz 6 months ago

The mention of AdamW is brief, but in his defense he includes a link that gives a gloss of it: "An updated overview of recent gradient descent algorithms" [https://johnchenresearch.github.io/demon/].
pizza 6 months ago

Luckily we have Shampoo, SOAP, Modula, Schedule-free variants, and many more these days being researched! I am very very excited by the heavyball library in particular
- 3abiton 5 months ago
  
  Been out of the loop for while, anything exciting?
  - pizza 5 months ago
    
    Read the Modular Norms in Deep Learning paper, and follow the author of the heavyball library on twitter with notifications enabled
ImageXav 6 months ago

Something that stuck out to me in the updated blog [0] is that Demon Adam performed much better than even AdamW, with very interesting learning curves. I'm wondering now why it didn't become the standard. Anyone here have insights into this?
[0] https://johnchenresearch.github.io/demon/
- gzer0 6 months ago
  
  Demon Adam didn’t become standard largely for the same reason many “better” optimizers never see wide adoption: it’s a newer tweak, not clearly superior on every problem, is less familiar to most engineers, and isn’t always bundled in major frameworks. By contrast, AdamW is now the “safe default” that nearly everyone supports and knows how to tune, so teams stick with it unless they have a strong reason not to.
  Edit: Demon involves decaying the momentum parameter over time, which introduces a new schedule or formula for how momentum should be reduced during training. That can feel like additional complexity or a potential hyperparameter rabbit hole. Teams trying to ship products quickly often avoid adding new hyperparameters unless the gains are decisive.

sega_sai 6 months ago

Interesting, but it does not seem to be an overview of gradient optimisers, but rather gradient optimisers in ML, as I see no mentions of BFGS and the likes.

VHRanger 6 months ago

I'm also curious about gradient-less algorithms
For non deep learning applications, Nelder-Mead saved my butt a fees times
- analog31 6 months ago
  
  It's with the utmost humility that I confess to falling back on "just use Nelder-Mead" in *scipy.optimize* when something is ill behaved. I consider it to be a sign that I'm doing something wrong, but I certainly respect its use.
- imurray 6 months ago
  
  Nelder–Mead has often not worked well for me in moderate to high dimensions. I'd recommend trying Powell's method if you want to quickly converge to a local optimum. If you're using scipy's wrappers, it's easy to swap between the two:
  https://docs.scipy.org/doc/scipy/reference/optimize.html#loc...
  For nastier optimization problems there are lots of other options, including evolutionary algorithms and Bayesian optimization:
  https://facebookresearch.github.io/nevergrad/
  https://github.com/facebook/Ax
- kernc 6 months ago
  
  SAMBO does a good job of finding the global optimum in a black-box manner even compared to Nelder-Mead, according to its own benchmark ...
  https://sambo-optimization.github.io
- amelius 6 months ago
  
  ChatGPT also advised me to use NM a couple of times, which was neat.
- woadwarrior01 6 months ago
  
  Look into zeroth-order optimizers and CMA-ES.
mike-the-mikado 6 months ago

I think the big difference is dimensionality. If the dimensionality is low, then taking account of the 2nd derivatives becomes practical and worthwhile.
- juliangoldsmith 6 months ago
  
  What is it that makes higher order derivatives less useful at high dimensionality? Is it related to the Curse of Dimensionality, or maybe something like exploding gradients at higher orders?
  - mike-the-mikado 5 months ago
    
    In n dimensions, the first derivative is an n-element vector. The second derivative is an n x n (symmetric) matrix. As n grows, the computation required to estimate the matrix increases (as at least n^2) and computation needed to use it increases (possibly faster).
    In practice, clever optimisation algorithms that use the 2nd derivative won't actually form this matrix.

ipunchghosts 6 months ago

Example of thr bitter lesson. None of these nuanced matter 8 years later where everyone uses sgd or adamw.

sk11001 6 months ago

It's a great summary for ML interview prep.

janalsncm 6 months ago

I disagree, it is old and most of those algorithms aren’t used anymore.
- sk11001 6 months ago
  
  That’s how interviews go though, it’s not like I’ve ever had to use Bayes rule at work but for a few years everyone loved asking about it in screening rounds.
  - mike-the-mikado 6 months ago
    
    In my experience a lot of people "know" maths, but fail to recognise the opportunities to use it. Some of my colleagues were pleased when I showed them that their ad hoc algorithm was equivalent to an application of Bayes' rule. It gave them insights into the meaning of constants that had formerly been chosen by trial and error.
  - janalsncm 6 months ago
    
    Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.
    Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.
    
    yobbo 6 months ago
    
    For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.
    As a model is trained, the gradient variance typically falls.
    Those optimizers all work to reduce the variance of the updates in various ways.
  - esafak 6 months ago
    
    I'd still expect an MLE to know it though.
    
    janalsncm 6 months ago
    
    Why would you? Implementing optimizers isn’t something that MLEs do. Even the Deepseek team just uses AdamW.
    An MLE should be able to look up and understand the differences between optimizers but memorizing that information is extremely low priority compared with other information they might be asked.