Podcasting
Advertisers
Enterprise
Pricing
Resources
Discover Discover

Log in
Sign up free

Papers Read on AI

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

2022-09-20

Download 54

Adaptive gradient algorithms [1–4] borrow the moving average idea of heavy ball acceleration to estimate accurate ﬁrst- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory [5] and also in many empirical cases [6] is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to effec-tively speedup the tra...

Adaptive gradient algorithms [1–4] borrow the moving average idea of heavy ball acceleration to estimate accurate ﬁrst- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory [5] and also in many empirical cases [6] is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to effec-tively speedup the training of deep neural networks. Adan ﬁrst reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the ﬁrst- and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan ﬁnds an (cid:15) -approximate ﬁrst-order stationary point within O (cid:0) (cid:15) − 3 . 5 (cid:1) stochastic gradient complexity on the nonconvex stochastic problems ( e.g. deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on both CNNs and transformers, and sets new SoTAs for many popular networks and frameworks, e.g. ResNet [7], ConvNext [8], ViT [9], Swin [10], MAE [11], LSTM [12], TransformerXL [13] and BERT [14]. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable 2022: Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, Shuicheng Yan https://arxiv.org/pdf/2208.06677v2.pdf

View more

Comments (3)

More Episodes

You may also like

Cyber Security Headlines

Babbage from The Economist

Big Technology Podcast

Software Engineering Daily

Cybersecurity Today

Techmeme Ride Home

Bloomberg Technology

Get this podcast on your phone, Free

Creat Yourt Podcast In Minutes

Full-featured podcast site
Unlimited storage and bandwidth
Comprehensive podcast stats
Distribute to Apple Podcasts, Spotify, and more
Make money with your podcast

It is Free

Podcast Services
MONETIZATION & MORE
KNOWLEDGE BASE
Support
Podbean

Privacy Policy
Cookie Policy
Terms of Use
Consent Preferences
Copyright © 2015-2025 Podbean.com