Exploring Optimization methods with parallel computation techniques
Given a system of linear equations, solve it using the iterative optimization techniques, e.g. Gradient Descent and its faster variants. A system of linear equations is given by
where A is the coefficient matrix and x is vector of variables.
This problem can be converted to an optimization problem.
Gradient of the above function is given by
The gradient at consecutive steps can grow large. To avoid this the original equation is divided by the max element in the equation or by the determinant of the matrix A on both side.
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.
Gradient descent is based on the observation that if the multi-variable function F(\textbf{x}) is defined and differentiable in a neighborhood of a point decreases fastest if one goes from in the direction of the negative gradient of F at .
It follows that, if
for small enough, then . In other words, the term is subtracted from because we want to move against the gradient, towards the minimum.
With this observation in mind, one starts with a guess for a local minimum of F, and considers the sequence , such that
We have a monotonic sequence
so hopefully the sequence converges to the desired local minimum. Note that the value of the step size is allowed to change at every iteration. With certain assumptions on the function F (for example, F convex and Lipschitz) and particular choices of , convergence to a local minimum can be guaranteed. When the function F is convex, all local minima are also global minima, so in this case gradient descent can converge to the global solution.
- For strongly convex and smooth functions, the convergence rate of gradient descent is , where is error tolerance.
- For convex and smooth functions, the convergence rate of gradient descent is
- Gradient Descent algorithm focuses on improving cost per iteration, which might be too “short-sighted” view. So, we can exploit information from past iterations to get better idea of gradient.
- Gradient Descent algorithm might sometimes zigzag or experience abrupt changes. So we can add buffer (like momentum) to yield smoother trajectory and hence avoid zigzag or abrupt changes.
We consider the objective function to be l-strongly convex and L-smooth. Thus, .
The heavy-ball method is a multi-step iterative method that exploits iterates prior to the most recent one. It does so by maintaining the momentum of the previous two iterates. If you imagine each iterate as a point in the trajectory of a falling ball, then the points carry with them a momentum or a velocity. The heavy ball method is one way of guarding against zigzagging trajectories in iterative descent.
Our task is to minimize the function where . Heavy ball method is also called chebyshev iterative method. Its update rule is
The convergence rate of heavy ball method is
Nesterov's Accelerated Gradient Descent performs a simple step of gradient descent to go from , and then it ‘slides’ a little bit further than in the direction given by the previous point
It alternates between gradient updates and proper extrapolation. each iteration takes nearly same cost as GD. It is not a descent method (i.e. we may not have
In order to minimize the function . The update rule is:
The iteration complexity of the algorithm is . This overall is much faster than the gradient decent. Also, the coefficient can be shown to be mystically working using the differential equation and damped spring based motion.
Fast iterative shrinkage-thresholding algorithm (FISTA) preserves the computational simplicity of FISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. It again is not a descent method but works as well as Nesterov.
In order to minimize the function where . The update rule is:
Here, the proximal function is taken to be zero. In general, proximal of any function with parameter is given by
if the function is taken is taken as f(x) = constant, which is the case with us, proximal will be same as v. Hence, the update rule remains the same.
FISTA has improved iteration complexity (i.e. ) than proximal gradient method (i.e. ). Also, fast if proximal can be efficiently implemented. Convergence is similar to that of the Nesterov since coefficient is asymptotically equivalent to
Convergence steps required for the each algorithm vs the size of the problem i.e n is shown.
The serial code implementation will work fine and take only reasonable amount of time when the size of matrix A will be below a certain threshold. But as the size of matrix A will increase,the time for convergence for all the 4 techniques will increase and that is why we need parallel implementation of the same serial implementation so as to reduce this time of convergence significantly even for large size of matrix A.
When applying gradient descent or any of its 3 variants there are some basic utility functions which are called by these gradient descent methods to perform the matrix and vector operations. Now these matrix and vector operations are the main points where the time complexity of the implementation is decided. If these operations will be slow then our overall implementation will be slow and vice-versa. So to tackle this problem in case of large matrix A, we try to parallelise these utility functions. Parallelising these operations reduces the overall time complexity significantly. We use NVIDIA's parallel computing architecture, CUDA to parallelize these utility functions.
This section includes the pseudo code for the matrix and vector functions implemented by us. We can parallel the whole code by either paralleling the matrix/vector operations, which is paralleling at the lowest level. Since every thing boils down to the matrix ops we will get a overall performance gain.
Also this will involve copying of the various data for each of the sub operations involved in calculations of gradient. We can avoid this by implementing a specific function for parallel gradient computation.
Following table compares the time taken for the respective utility functions for serial and parallel implementation in CUDA.
For performance analysis, Time taken VS n is each plotted for n=1 to n=100000, where n denotes the size of the matrix A, for each of the four different techniques. Also the number of steps taken to converge is plotted against n for serial implementation and same graph is obtained for parallel implementation.
Following observations can be made for serial implementation:
- Steps/time taken to converge for heavy ball is less than GD.
- And steps/time taken to converge for Nestrov and FISTA is almost same and less than heavy ball.
Following observations can be made for parallel implementation:
- Steps/time taken to converge for heavy ball is less than GD.
- And steps/time taken to converge for Nestrov and FISTA is almost same and less than heavy ball.
Comparison of two implementations:
- Time taken by individual utility function is calculated for serial and parallel implementation of these functions.
- Performance of these utility functions is shown using the ratio of time taken by CPU and time taken by GPU.
- No effect in the number of steps taken to converge.
- Time taken to converge gets reduced as compared to serial implementation.