Skip to content

somefunAgba/autosgm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoSGM : A Unifying Framework for Accelerated Learning

Automatic (Stochastic) Gradient Method (SGM) is a framework for stochastic gradient learning that unifies (Polyak's Heavy Ball (PHB), Nesterov's Accelerated Gradient (NAG), Adaptive Moment Estimation (Adam)) used in deep learning.

Learning is seen as an interconnection between a gradient-generating system like an artificial neural network (a well-defined differentiable function) with the SGM learning system or control function.

This suggests that there is only one (stochastic) gradient method (SGM), with different approaches or metrics to both setting-up the learning rate $\alpha_t$, smoothing the gradient $\mathrm{g}_t$ and smoothing the gradient-generating system parameters $\mathrm{w}_t$ by various lowpass filter implementations $\mathbb{E}_{t,\beta}\{\cdot\}$ where $0 \le \beta < 1$. The result is the different momentum-based SGD variants in the literature.

This repo. contains implementation(s) of AutoSGM: ${\rm w}_t = \mathcal{C}\bigl( {{\rm g}_t} \bigr)$

Expected input $\mathrm{g}_t$ is a first-order gradient, and output $\mathrm{w}_t$ is an estimate of each parameter in an (artificial) neural network.

$$\begin{align} \mathrm{g}_t \leftarrow \mathbb{E}_{t,\beta_i}\{ \mathrm{g}_t \},\quad{\rm w}_t \leftarrow \mathbb{I}_{t, \alpha_t}\{ {\rm g_t} \},\quad {\rm w}_t \leftarrow \mathbb{E}_{t,\beta_o}\{{\rm w}_t\} \end{align}$$
  • a time-integration $\mathbb{I}_{t, \alpha_t}$ component, controlled by a proportional learning rate parameter $\alpha_t$.
  • a lowpass smoothing component $\mathbb{E}_{t, \beta}$ regularizing the gradient generating system, with a lowpass parameter $\beta$, at the input where $\beta:= \beta_i$ and the output where $\beta := \beta_o$.

It explains observed acceleration in the SGM as the consequence of lowpass smoothing. This framework leads to many implementations, as seen in the deep learning literature. It makes sense of the many variants in use today.

It also allows to derive an optimal choice of learning rate. Adam can be seen as one approximation of this optimal choice (normalized gradients).

Dependencies

Code is entirely in Python, using PyTorch.

Getting Started (installing)

Download or clone locally with git.

>> git clone https://github.com/somefunagba/autosgm.git

PyTorch API:

Calling the implementation

Assume this repository was directly git cloned to the root path of your project.

from opts.autosgml import AutoSGM

This loads an AutoSGM implementation.


Examples

Some examples from the PyTorch Examples Repo. have been added as demo. See the cases folder.


Possible options are documented in opts/autosgml. Some of the defaults, might likely need not be changed.

Given a neural network model called mdl has been constructed with PyTorch. The following examples illustrate how parameters of the model mdl.parameters()may be optimized or learnt with this AutoSGM implementation.

By default, this implementation, auto-tunes an initial learning iteratively, which in the code snippet below has been set as lr_init=1e-4.

optimizer = AutoSGM(mdl.parameters(), lr_init=1e-4)

To use only moment estimation, in tuning the learning rate, for all iteration, the code snippet below uses a single constant value of lr_init=3e-4 with a normalized gradient.

optimizer = AutoSGM(mdl.parameters(), autolr=False, lr_init=3e-4)

The code snippet below disables any optimal learning-rate estimation and uses a single learning rate constant lr_init=1e-3.

optimizer = AutoSGM(mdl.parameters(), lr_init=5e-4, autolr=None)

Also, important parameters to configure apart from the initial learning rate are 3 main lowpass (often called momentum) parameters in beta_cfg. The first two are respectively for iteratively smoothing the gradient input, smoothing the weight output. The third is for estimating the gradient's variance/moment, which also adapts the learning rate.

By smoothing, we mean the lowpass filter is used to carefully filter high frequency noise components from its input signal. By averaging, we mean the lowpass filter is used to estimate a statistical expectation function. Note that when using the first-order lowpass filter: For smoothing, the lowpass parameter is often less or equal to 0.9 but for averaging, the lowpass parameter is often greater than 0.9.

By default, the values in beta_cfg are sensible theoretical values, which should be changed depending on what works and a feel for the linear nature of the learning system (neural network). Annotations on available options are documented in opts/autosgml.

optimizer = AutoSGM(mdl.parameters(), lr_init=1e-4, beta_cfg=(0.9,0.1,0.999,0))

Disclaimer

The code and style in this repository is still undergoing active development as part of my PhD work. Feel free to raise an issue, if you detect any bug or you have any questions.