Skip to content

Exploring Optimization methods with parallel computation techniques

Notifications You must be signed in to change notification settings

anujasaini/Accelerated-Gradient-Descent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Accelerated-Gradient-Descent-

Exploring Optimization methods with parallel computation techniques

Problem Statement

Given a system of linear equations, solve it using the iterative optimization techniques, e.g. Gradient Descent and its faster variants. A system of linear equations is given by

first image where A is the coefficient matrix and x is vector of variables.

This problem can be converted to an optimization problem.

Gradient of the above function is given by

The gradient at consecutive steps can grow large. To avoid this the original equation is divided by the max element in the equation or by the determinant of the matrix A on both side.

Gradient Descent

Introduction

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

Algorithm

Gradient descent is based on the observation that if the multi-variable function F(\textbf{x}) is defined and differentiable in a neighborhood of a point decreases fastest if one goes from in the direction of the negative gradient of F at .
It follows that, if

for small enough, then . In other words, the term is subtracted from because we want to move against the gradient, towards the minimum.

With this observation in mind, one starts with a guess for a local minimum of F, and considers the sequence , such that

We have a monotonic sequence

,

so hopefully the sequence converges to the desired local minimum. Note that the value of the step size is allowed to change at every iteration. With certain assumptions on the function F (for example, F convex and Lipschitz) and particular choices of , convergence to a local minimum can be guaranteed. When the function F is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.

Complexity and Convergence Analysis

Complexity

  • For strongly convex and smooth functions, the convergence rate of gradient descent is , where is error tolerance.
  • For convex and smooth functions, the convergence rate of gradient descent is

Convergence Analysis

  • Gradient Descent algorithm focuses on improving cost per iteration, which might be too “short-sighted” view. So, we can exploit information from past iterations to get better idea of gradient.
  • Gradient Descent algorithm might sometimes zigzag or experience abrupt changes. So we can add buffer (like momentum) to yield smoother trajectory and hence avoid zigzag or abrupt changes.

Heavy Ball Method

Introduction

We consider the objective function to be l-strongly convex and L-smooth. Thus, .

The heavy-ball method is a multi-step iterative method that exploits iterates prior to the most recent one. It does so by maintaining the momentum of the previous two iterates. If you imagine each iterate as a point in the trajectory of a falling ball, then the points carry with them a momentum or a velocity. The heavy ball method is one way of guarding against zigzagging trajectories in iterative descent.

Algorithm

Our task is to minimize the function where . Heavy ball method is also called chebyshev iterative method. Its update rule is

Complexity and Convergence Analysis

The convergence rate of heavy ball method is

Nesterov's accelerated gradient methods

Introduction

Nesterov's Accelerated Gradient Descent performs a simple step of gradient descent to go from , and then it ‘slides’ a little bit further than in the direction given by the previous point

It alternates between gradient updates and proper extrapolation. each iteration takes nearly same cost as GD. It is not a descent method (i.e. we may not have

Algorithm

In order to minimize the function . The update rule is:

Complexity and Convergence Analysis

The iteration complexity of the algorithm is . This overall is much faster than the gradient decent. Also, the coefficient can be shown to be mystically working using the differential equation and damped spring based motion.

Accelerated proximal gradient methods (FISTA)

Introduction

Fast iterative shrinkage-thresholding algorithm (FISTA) preserves the computational simplicity of FISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. It again is not a descent method but works as well as Nesterov.

Algorithm

In order to minimize the function where . The update rule is:

where ,

Here, the proximal function is taken to be zero. In general, proximal of any function with parameter is given by

if the function is taken is taken as f(x) = constant, which is the case with us, proximal will be same as v. Hence, the update rule remains the same.

Complexity and Convergence Ananlysis

FISTA has improved iteration complexity (i.e. ) than proximal gradient method (i.e. ). Also, fast if proximal can be efficiently implemented. Convergence is similar to that of the Nesterov since coefficient is asymptotically equivalent to

Results from Serial Code

Convergence steps required for the each algorithm vs the size of the problem i.e n is shown.

Shift to Parallel Programming Paradigm

The serial code implementation will work fine and take only reasonable amount of time when the size of matrix A will be below a certain threshold. But as the size of matrix A will increase,the time for convergence for all the 4 techniques will increase and that is why we need parallel implementation of the same serial implementation so as to reduce this time of convergence significantly even for large size of matrix A.

Parallel Segment Identification

When applying gradient descent or any of its 3 variants there are some basic utility functions which are called by these gradient descent methods to perform the matrix and vector operations. Now these matrix and vector operations are the main points where the time complexity of the implementation is decided. If these operations will be slow then our overall implementation will be slow and vice-versa. So to tackle this problem in case of large matrix A, we try to parallelise these utility functions. Parallelising these operations reduces the overall time complexity significantly. We use NVIDIA's parallel computing architecture, CUDA to parallelize these utility functions.

Parallel Implementation

This section includes the pseudo code for the matrix and vector functions implemented by us. We can parallel the whole code by either paralleling the matrix/vector operations, which is paralleling at the lowest level. Since every thing boils down to the matrix ops we will get a overall performance gain.

Also this will involve copying of the various data for each of the sub operations involved in calculations of gradient. We can avoid this by implementing a specific function for parallel gradient computation.

Comparison of Parallel and Serial Code

Following table compares the time taken for the respective utility functions for serial and parallel implementation in CUDA.

Observation

For performance analysis, Time taken VS n is each plotted for n=1 to n=100000, where n denotes the size of the matrix A, for each of the four different techniques. Also the number of steps taken to converge is plotted against n for serial implementation and same graph is obtained for parallel implementation.

Following observations can be made for serial implementation:

  • Steps/time taken to converge for heavy ball is less than GD.
  • And steps/time taken to converge for Nestrov and FISTA is almost same and less than heavy ball.

Following observations can be made for parallel implementation:

  • Steps/time taken to converge for heavy ball is less than GD.
  • And steps/time taken to converge for Nestrov and FISTA is almost same and less than heavy ball.

Comparison of two implementations:

  • Time taken by individual utility function is calculated for serial and parallel implementation of these functions.
  • Performance of these utility functions is shown using the ratio of time taken by CPU and time taken by GPU.
  • No effect in the number of steps taken to converge.
  • Time taken to converge gets reduced as compared to serial implementation.

About

Exploring Optimization methods with parallel computation techniques

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages