In this post, we will explain how Gradient Descent (GD) works and why it can converge very slowly.

The simplest first-order optimization algorithm is Gradient Descent. It is used to minimize a *convex differentiable* function over . The update rule is simply , and the pseudo-code is in Algorithm 1.

Last time we saw that Subgradient Descent (SD) is not a descent method. What about GD? GD with an appropriate choice of the stepsizes is a descent method, that is . However, GD is far from being perfect, because *the negative gradient does not exactly point towards the minimum*. Let’s see why.

**Condition Number.** To better understand this concept we will use a nice property of gradients, often overlooked. First, let’s define what is the *level set* of a function, that is the set of all the points that have the same value of .

Definition 1The -level set of a function is defined as .

To visualize the level set, consider the two-dimensional example of a topographic map in Figure 1. The black lines are exactly the level sets corresponding to different values of the 2-d function, that in this case are just the altitudes.

We can now state the following proposition.

Proposition 2Let a differentiable function from to and . Then, is orthogonal to the level set at .

We will not prove it, because there is not much to learn from the proof for us.

What does it mean in words? It means that if I draw the level sets of a function, I know immediately in which direction the gradient goes. This is very powerful to develop a geometric intuition of how GD works. For example, if we take even simple convex two-dimensional functions, we can see that the negative gradient does not really point where we would like it to point.

Let’s see an example. Consider the 2-d function in Figure 2, that has the minimum in .

The negative gradient in the point is , in black arrow in Figure 2. As stated above, this gradient is orthogonal to the level set . See how the negative gradient is pointing in a direction that is not towards the minimum. In general, for convex functions, the angle between the negative gradient in a point and the vector the connects the minimum and can be arbitrarily close to 90 degrees.

Should we be worried about it? Definitely! The fact that the negative gradient does not point directly to the minimum is *exactly* what slows down the convergence of GD. In fact, even selecting the step size in the optimal way, i.e. choosing the stepsize that guarantees the maximum decrease of the function at each step, we obtain the behavior in Figure 2. Did you expect the sawtooth path of GD?

On the other hand, if we consider the function , that has the minimum in , the level sets are circles centered in the origin, see Figure 3. So, in any point the negative gradient will point exactly towards the minimum! In this case, GD with an appropriate stepsize will reach the solution in *one* step.

Can we capture in some way the difference between these two cases? Yes, with the **condition number**, exactly that obscure concept that Ali Rahimi mentioned in his very controversial talk.

We will define the condition number as the maximum eigenvalue divided by the minimum one of the Hessian of a function. We can show that a high condition number corresponds to a slow convergence, while a low condition number (the minimum is of course 1) corresponds to a fast convergence. Note that for a two-dimensional quadratic function, the maximum and minimum eigenvalue are proportional to the length of the axes of the ellipses formed by the level sets. So, it exactly captures the geometry of our two examples above.

**Convergence Guarantee.** It is now time for some math and to put all this together and to prove the convergence rate of gradient descent, that we expect to depend on the condition number. The proof works by showing that in each step we decrease the value of the objective function proportional to the norm of the gradient, and that the norm of the gradient itself depends on the suboptimality gap.

Theorem 3Let a convex twice differentiable function from to , and be the maximum eigenvalue of its Hessian and the minimum one. Set . Then we have

*Proof:* Denoting by the Hessian of in , from the Taylor expansion of around , we have

for some on the line segment between and . Hence, the assumption of the maximum eigenvalue of the Hessian implies

Note that our definition of the step size , maximize the decrement of the function in . Indeed, we have

Now, using again the Taylor expansion, for some on the line segment between and , we have

Hence, using the fact that , we have

Putting (1) and (2) together and subtracting to both sides, we have

So, we just proved that in GD the suboptimality gap shrinks exponentially fast! And the exponent depends on the condition number. Also, this rate is almost optimal, using acceleration we can depend on the square root of the condition number.

A curiosity: This rate of convergence is called *linear*, because when you plot the logarithm of the suboptimality with respect to the number of iterations you get a straight line.

**How to Choose the Stepsizes for GD in Practice?** In the above theorem we assumed the stepsize to be constant and equal to . I want to remind you that the constant step size can be used because GD will automatically slow down approaching the minimum due to the smoothness of the function. Remember that this does not happen in subgradient descent on non-differentiable functions, see previous blog post.

So, in order to set the stepsize, the theorem assumes that you know the maximum eigenvalue of the Hessian of the function that you are minimizing. Also, it implicitly assumes the maximum eigenvalue to be bounded. However, the boundedness assumption can be removed, observing that GD strictly decreases the function value, so we have to worry only about the maximum eigenvalue of the Hessian inside the set . But, even with this trick, we still don’t know .

So, is this one another useless theorem? No, because the above theorem also holds if at each time step we select by **exact line search** in order to guarantee the maximum decrease in the direction of the negative gradient. We leave the minor change to the proof as an exercise. Also, we don’t have to use exact line search, you can prove that a similar guarantee holds for an inexact line search that gives a *sufficient* decrease of the function value: google “Backtracking line search”.

So, we just discovered that GD with a good line search method is a **parameter-free** algorithm! This is not so surprising: “batch” optimization is a very mature field, where the culture equally weights theoretical and empirical performance. Do you think that optimization people would be happy with algorithms that were not fully automatic? Parameter-free optimization algorithms are actually the norm and nobody would call them “parameter-free”. Instead, they would just say that this is the right way to do. And if you think about it, why would you want an optimization algorithm to have parameters? When is the last time you had to set a learning rate to invert a matrix in MATLAB/Octave/NumPy?

If you wonder why line search algorithms are not common in the machine learning literature, the reason is that we tend to prefer stochastic optimization methods, where the line search becomes non-trivial. Indeed, there are papers on this issue, but it is still an open problem. So, in the stochastic setting, we will have to use different techniques.

Now, we know what is the condition number and we know how it influences the convergence speed. Can we do something to converge faster in the case of functions with a high condition number? Yes, but we have to pay a cost. Intuitively, we could just do a change of coordinate in the examples above, to go from the difficult case to the easy case. What is the optimal transformation? It is the one that makes the new Hessian as close as possible to an identity matrix. And this is exactly what a Netwon’s algorithm does! So, under some smoothness conditions on the objective function, we can expect a Netwon algorithm to have a convergence rate that is independent of the condition number. Indeed, on convex quadratic functions, the Newton’s algorithm will always converge to the minimum in one step. However, we have to pay the computational price of running the Netwon algorithm: calculating the Hessian, inverting it, etc. Of course, we could use quasi-Netwon’s algorithms that approximate the Netwon update to keep the computation complexity low, but that is another story.

]]>Photo by Jens Johnsson on Pexels.com

Subgradient descent (SD) is a very simple algorithm for minimizing a non-differentiable convex function. It proceeds iteratively updating the vector moving in the direction of a negative subgradient , but a positive factor called the stepsize. Also, we project onto the feasible set . The pseudocode is in Algorithm 1. It is similar to the gradient descent algorithm, but it has some important conceptual differences.

First, here, we do not assume differentiability of the function so we only have access to a subgradient in any point in , i.e. . Note that this situation is more common than one would think. For example, the hinge loss, , and the ReLU activation function used in neural networks, , are not differentiable.

A second difference, is that *SD is not a descent method*, that is, it can happen that . The objective function can stay the same or even increase over iterates, no matter what stepsize is used. In fact, a common misunderstanding is that in SD methods the subgradient tells us in which direction to go in order to decrease the value of the function. This wrong intuition often comes from thinking about one dimensional functions. Also, this makes the choice of the stepsizes critical to obtain convergence.

Let’s analyze these issues one at the time.

**Subgradient.** Let’s first define formally what is a subgradient. For a function , we define a subgradient of in as a vector that satisfies

Basically, a subgradient of in is any vector that allows us to construct a linear lower bound to . Note that the subgradient is not unique, so we denote the *set* of subgradients of in by . If the function is convex, differentiable in , and is finite in , we have that the subdifferential is composed by a unique element equal to (Rockafellar, R. T., 1970)[Theorem 25.1].

Now, we can take a look at the following examples that illustrate the fact that a subgradient does not always point in a direction where the function decreases.

Let , see Figure 1. The vector is a subgradient in of . No matter how we choose the stepsize, moving in the direction of the negative subgradient will not decrease the objective function. An even more extreme example is in Figure 2, with the function . Here, in the point , any positive step in the direction of the negative subgradient willincreasethe objective function.

Another effect of the fact that the objective function is non-differentiable is that we cannot use a constant stepsize. This is easy to visualize, even for one-dimensional functions. Take a look for example at the function in Figure 3(left). For any fixed stepsize and any number of iterations, we will never converge to the minimum of the function. Hence, if you want to converge to the minimum, you *must* use a decreasing stepsize. The same does not happen for smooth functions, for example in Figure 3(right), where the same constant stepsize gives rise to an automatic slowdown in the vicinity of the minimum, because the magnitude of the gradient decreases.

**Reintepreting the Subgradient Descent Algorithm.** Given that SD does not work because it moves towards the minimum, let’s see how it actually works. So, first we have to correctly interpret its behavior.

A way to understand how SD algorithms work is to think that they minimize a local approximation of the original objective function. This is not unusual for optimization algorithms, for example the Netwon’s algorithm construct an approximation with a Taylor expansion truncated to the second term. Thanks to the definition of subgradients, we can immediately build a linear lower bound to the function around :

Unfortunately, minimizing a linear function is unlikely to give us a good optimization algorithm. Indeed, over unbounded domains the minimum of a linear function can be . Hence, the intuitive idea is to constraint the minimization of this lower bound only in a neighboorhood of , where we know that the approximation is more precise. Coding the neighboorhood constraint with a L2 squared distance from less than some positive number , we might think to use the following update

Equivalently, for some , we can consider the unconstrained formulation

This is a well-defined update scheme, that hopefully moves closer to the optimum of . See Figure 4 for a graphical representation in one-dimension.

Solving the argmin and completing the square, we get

where is the Euclidean projection onto , i.e. . This is exactly the update of SD in Algorithm 1.

However, we can also obtain a different closed form update. Ignoring the constant term with respect to , from (1) we obtain the update step of SD in Algorithm 2. This will be useful when we will extend subgradient descent to different norms than L2.

**Convergence Analysis.** We have shown how to correctly intepret SD. Yet, its intepretation is not a proof of its convergence. Indeed, the interpretation above does not say anything about the stepsizes. So, here we will finally show its convergence guarantee.

We can now show the convergence guarantee for SD. In particular we want to show that

that is optimal for this setting. Keeping in mind that SD is not a descent method, the proof of convergence *cannot* aim at proving that at each step we descrease the function by a minimum amount. Instead, we show that the suboptimality gap at each step is upper bounded by a term, whose sum grows sublinearly. This means, that the average, and so the minimum, of the suboptimality gaps decreases with the number of iterations.

First, we show the following two Lemmas. The first lemma proves that Euclidean projections always decrease the distance with points inside the set.

Proposition 1Let and , where is a closed convex set. Then, .

*Proof:* By the convexity of , we have that

for all . Hence, from the optimality of , we have

Rearranging, we obtain

Taking the limit , we get that

Therefore,

The next lemma upper bounds the suboptimality gap in one iteration.

Lemma 2Let a closed non-empty convex set and a convex function from to . Set . Then, , the following inequality holds

*Proof:* From Proposition 1 and the definition of subgradient, we have that

Reordering, we have the stated bound.

Theorem 3Let a closed non-empty convex set with diameter , i.e. . Let a convex function from to . Set . Set . Then, , the following convergence bound hold

*Proof:* Dividing the inequality in Lemma 2 by and summing over , we have

Dividing both terms by , we have

This is almost a convergence rate, but we need to extract one of the in some way. Hence, we just use Jensen’s inequality on the lhs.

For the second statement, consider the inequality in Lemma 2, divide both sides by 2, and sum over .

Then, again use Jensen’s inequality.

For the third statement, use the fact that

We can immediately observe few things. First, the most known convergence bound for SD (first bound in the theorem) need a bounded domain . In most of the machine learning applications, this assumptions is false. In general, we should be very worried when we use a theorem whose assumptions are clearly false. It is like using a new car that is guaranteed to work only when the outside temperature is above 68 Degrees Fahrenheit, and in the other cases “it seems to work, but we are not sure about it”. Would you drive it?

So, for unbounded domains, we have to rely on the lesser-known second and third guarantees, see, e.g., Zhang, T. (2004).

The other observation is that the above guarantees *do not* justify the common heuristic of taking the last iterate. Indeed, as we said, SD is not a descent method so the last iterate is not guaranteed to be best one. In the following posts, we will see that we can still prove that the last iterate converges, even in the unbounded case, but the convergence rate gets an additional logarithmic factor.

Another important observation is to use the convergence guarantee to choose the stepsizes. Indeed, it is the only guideline we have. Any other choice that is not justified by a convergence analysis it is not justified at all. A simple choice is to find the constant stepsize that minimizes the bounds for a fixed number of iterations. For simplicity, let’s assume that the domain is bounded. So, in all cases, we have to consider the expression

and minimize with respect to . It is easy to see that the optimal is . See how that gives the regret bound

However, we have a problem: in order to use this stepsize, we should know the future subgradients! This is exactly the core idea of adaptive algorithms: how the obtain the best possible guarantee without knowning impossible quantities?

We will see many adaptive strategies, but, for the moment, we can observe that we might be happy to minimize a looser upper bound. In particular, assuming that the functions are -Lischtiz, we have that . Hence, we have

that gives a convergence rate of .

]]>I thought many times about opening a blog, to discuss the research in machine learning, but I never had enough motivation to start one. This time I just did it, before allowing my conscious mind to realize how much effort it will be needed…

The aims of this blog are two: dissemination and advertising.

Dissemination: I would like to explain some concepts in Machine Learning and Optimization that I see over and over again misunderstood in published papers. These misunderstandings are typically based on too much trust on intuition and not enough knowledge of math. Hence, here I will try to discuss the shortcomings of intuition in thinking about math problems and the *need of the mathematical formalism as a tool to solve difficult problems. *I know this sentence sounds weird to, for example, many ML practitioners, but I really believe that using mathematics makes the problems easier, not more difficult. I hope to be able to prove it in these pages.

Advertising: well, this should be obvious, I will simply talk about my research. This is what 90% of the ML blogs I have see around actually do. So, there is no reason to lie about it: You’ll see posts on my research as well

One note: I am Italian, so there will be English errors. On the other hand, my mathematics is usually better than my English

]]>