Intuition Behind Accelerated First Order Methods

Suppose that we want to solve the following convex optimization problem:


where we assumed that g(x) is convex and differentiable, h(x) is convex (here I am trying to be as non-specific as possible). Then recall that the generalized gradient descent can be formulated as follows:

Step 0: choose initial x^{0} \in \mathbb{R}^n

Step k: k \ge 1: x^{(k)} = \prox_h (x^{(k-1)} – t_k \nabla g(x^{(k-1)}), t_k)

where \prox is a proximal operator defined as \prox_h(y,t) := \argmin \limits_{x \in \mathbb{R}^n} h(x) + \frac{1}{2t} \|y-x\|^2

It is known that if \nabla g(x) is Lipschitz continuous and proximal operator can be evaluated, the convergence rate of will be O(1/k). This result can be accelerated to achieve O(1/k^2).

First time proposed by Nesterov in 1983 for smooth functions, the idea of acceleration still remains an active topic of research (for non-smooth, composite functions, etc.). It is not easy to read Nesterov’s works (very mathematical), but in order to get an understanding of the concept it is sufficient to look at ISTA (Iterative Tthresholding Algorithms) and FISTA (Fast Iterative Thresholding Algorithms). In particular, my questions below will be based on FISTA’s example:

Roughly speaking, acceleration is achieved by introducing one more sequence of numbers y_k constructed as a specific linear combination of x_k and x_{k-1}; proximal function operates then on y_{k} instead of x_k. In case of FISTA we have:

t_{k+1} = \frac{1 + \sqrt{1 + 4t^2_k}}{2}

y_{k+1} = x_k + \frac{t_k – 1}{t_{k+1}}(x_k – x_{k-1})

Note, that the sequence t_k satisfies t^2_k = t^2_{k+1} – t_{k+1}; this is justified in the proof of the convergence for this algorithm.

Is there any intuitive way to explain, interpret such an approach? Why such a specific combination works and brings such a perceptible improvement to the convergence rate? Can we find an intuitive way to interpret t? Probably somebody is actually familiar with Nesterov’s works and have more knowledge that I do about some other reasons why t_k is given in this form at first place?


I asked this question myself while trying to understand accelerated methods, asked around, and was pointed to this paper by a professor to help gain some intuition:

To summarize: Su et. al. take Nesterov’s accelerated gradient method and take the step size to be infinitesimally small to derive the following ODE
\ddot{X} + \frac{3}{t} \dot{X} + \nabla f(X) = 0
with initial conditions X(0) = x_0 and \dot{X}(0) = 0. By analyzing this ODE, we can get a better idea about what Nesterov’s accelerated gradient method is doing. Hope this resource is helpful!

Source : Link , Question Author : trembik , Answer Author : Michael Shi

Leave a Comment