Subgradient descent (SD) is a very simple algorithm for minimizing a non-differentiable convex function. It proceeds iteratively updating the vector moving in the direction of a negative subgradient , but a positive factor called the stepsize. Also, we project onto the feasible set . The pseudocode is in Algorithm 1. It is similar to the gradient descent algorithm, but it has some important conceptual differences.
First, here, we do not assume differentiability of the function so we only have access to a subgradient in any point in , i.e. . Note that this situation is more common than one would think. For example, the hinge loss, , and the ReLU activation function used in neural networks, , are not differentiable.
A second difference, is that SD is not a descent method, that is, it can happen that . The objective function can stay the same or even increase over iterates, no matter what stepsize is used. In fact, a common misunderstanding is that in SD methods the subgradient tells us in which direction to go in order to decrease the value of the function. This wrong intuition often comes from thinking about one dimensional functions. Also, this makes the choice of the stepsizes critical to obtain convergence.
Let’s analyze these issues one at the time.
Subgradient. Let’s first define formally what is a subgradient. For a function , we define a subgradient of in as a vector that satisfies
Basically, a subgradient of in is any vector that allows us to construct a linear lower bound to . Note that the subgradient is not unique, so we denote the set of subgradients of in by . If the function is convex, differentiable in , and is finite in , we have that the subdifferential is composed by a unique element equal to (Rockafellar, R. T., 1970)[Theorem 25.1].
Now, we can take a look at the following examples that illustrate the fact that a subgradient does not always point in a direction where the function decreases.
Let , see Figure 1. The vector is a subgradient in of . No matter how we choose the stepsize, moving in the direction of the negative subgradient will not decrease the objective function. An even more extreme example is in Figure 2, with the function . Here, in the point , any positive step in the direction of the negative subgradient will increase the objective function.
Another effect of the fact that the objective function is non-differentiable is that we cannot use a constant stepsize. This is easy to visualize, even for one-dimensional functions. Take a look for example at the function in Figure 3(left). For any fixed stepsize and any number of iterations, we will never converge to the minimum of the function. Hence, if you want to converge to the minimum, you must use a decreasing stepsize. The same does not happen for smooth functions, for example in Figure 3(right), where the same constant stepsize gives rise to an automatic slowdown in the vicinity of the minimum, because the magnitude of the gradient decreases.
Reintepreting the Subgradient Descent Algorithm. Given that SD does not work because it moves towards the minimum, let’s see how it actually works. So, first we have to correctly interpret its behavior.
A way to understand how SD algorithms work is to think that they minimize a local approximation of the original objective function. This is not unusual for optimization algorithms, for example the Netwon’s algorithm construct an approximation with a Taylor expansion truncated to the second term. Thanks to the definition of subgradients, we can immediately build a linear lower bound to the function around :
Unfortunately, minimizing a linear function is unlikely to give us a good optimization algorithm. Indeed, over unbounded domains the minimum of a linear function can be . Hence, the intuitive idea is to constraint the minimization of this lower bound only in a neighboorhood of , where we know that the approximation is more precise. Coding the neighboorhood constraint with a L2 squared distance from less than some positive number , we might think to use the following update
This is a well-defined update scheme, that hopefully moves closer to the optimum of . See Figure 4 for a graphical representation in one-dimension.
Solving the argmin and completing the square, we get
where is the Euclidean projection onto , i.e. . This is exactly the update of SD in Algorithm 1.
However, we can also obtain a different closed form update. Ignoring the constant term with respect to , from (1) we obtain the update step of SD in Algorithm 2. This will be useful when we will extend subgradient descent to different norms than L2.
Convergence Analysis. We have shown how to correctly intepret SD. Yet, its intepretation is not a proof of its convergence. Indeed, the interpretation above does not say anything about the stepsizes. So, here we will finally show its convergence guarantee.
We can now show the convergence guarantee for SD. In particular we want to show that
that is optimal for this setting. Keeping in mind that SD is not a descent method, the proof of convergence cannot aim at proving that at each step we descrease the function by a minimum amount. Instead, we show that the suboptimality gap at each step is upper bounded by a term, whose sum grows sublinearly. This means, that the average, and so the minimum, of the suboptimality gaps decreases with the number of iterations.
First, we show the following two Lemmas. The first lemma proves that Euclidean projections always decrease the distance with points inside the set.
Proof: By the convexity of , we have that
for all . Hence, from the optimality of , we have
Rearranging, we obtain
Taking the limit , we get that
The next lemma upper bounds the suboptimality gap in one iteration.
Proof: From Proposition 1 and the definition of subgradient, we have that
Reordering, we have the stated bound.
Proof: Dividing the inequality in Lemma 2 by and summing over , we have
Dividing both terms by , we have
This is almost a convergence rate, but we need to extract one of the in some way. Hence, we just use Jensen’s inequality on the lhs.
For the second statement, consider the inequality in Lemma 2, divide both sides by 2, and sum over .
Then, again use Jensen’s inequality.
For the third statement, use the fact that
We can immediately observe few things. First, the most known convergence bound for SD (first bound in the theorem) need a bounded domain . In most of the machine learning applications, this assumptions is false. In general, we should be very worried when we use a theorem whose assumptions are clearly false. It is like using a new car that is guaranteed to work only when the outside temperature is above 68 Degrees Fahrenheit, and in the other cases “it seems to work, but we are not sure about it”. Would you drive it?
So, for unbounded domains, we have to rely on the lesser-known second and third guarantees, see, e.g., Zhang, T. (2004).
The other observation is that the above guarantees do not justify the common heuristic of taking the last iterate. Indeed, as we said, SD is not a descent method so the last iterate is not guaranteed to be best one. In the following posts, we will see that we can still prove that the last iterate converges, even in the unbounded case, but the convergence rate gets an additional logarithmic factor.
Another important observation is to use the convergence guarantee to choose the stepsizes. Indeed, it is the only guideline we have. Any other choice that is not justified by a convergence analysis it is not justified at all. A simple choice is to find the constant stepsize that minimizes the bounds for a fixed number of iterations. For simplicity, let’s assume that the domain is bounded. So, in all cases, we have to consider the expression
However, we have a problem: in order to use this stepsize, we should know the future subgradients! This is exactly the core idea of adaptive algorithms: how the obtain the best possible guarantee without knowning impossible quantities?
We will see many adaptive strategies, but, for the moment, we can observe that we might be happy to minimize a looser upper bound. In particular, assuming that the functions are -Lischtiz, we have that . Hence, we have
that gives a convergence rate of .