Numerical Optimization#

We have seen some strategies to find a solution to optimization objectives by means of solving a system of equations. These strategies work usually fine for simple objectives, but if I have a lot of valleys and hills in my optimization objective, then solving this system of equations analytically will not always be possible. What can we do then?

If the minimizers can not be computed directly/analytically, then numerical optimization can come to the rescue. The idea is that I start somewhere in my hilly landscape and then try to walk to a valley with a specified strategy. For those types of methods it’s good to know how good the strategy is. Important is for example to ask whether I will ever arrive at my minimum if I just walk long enough, or if I have to wonder endlessly around in a bad case. And what happens if I just walk a few steps, would I then have improved upon my starting position or might I not have descended at all? We state here two very popular numerical optimization methods: coordinate descent and gradient descent. Both are presented for the optimization of an unconstrained objective, but they do have extensions to incorporate constraints as well. This is beyond of the scope of this course, though.
The general scheme of numerical optimization methods is typically

Algorithm 1 (Numerical Optimization)

Input: the function \(f\) to minimize, the maximum number of steps \(t_{max}\)

  1. \(\vvec{x}_0\gets\) Initialize(\(\vvec{x}_0\))

  2. for \(t\in\{1,\ldots,t_{max}-1\}\)

    1. \(\vvec{x}_{t+1}\gets \)Update(\(\vvec{x}_t,f\))

  3. return \(\vvec{x}_{t_{max}}\)

Coordinate descent#

The coordinate descent method is promising if we can not determine the minimum to the function analytically, but the minimum in a coordinate direction. The update is performed by cycling over all coordinates and walking to the minimum subject to that coordinate in each step.

Algorithm 2 (Coordinate Descent)

Input: the function \(f\) to minimize

  1. \(\vvec{x}_0\gets\) Initialize(\(\vvec{x}_0\))

  2. for \(t\in\{1,\ldots,t_{max}-1\}\)

    1. for \(i\in\{1,\ldots d\}\) do

      1. \(\displaystyle {x_i}^{(t+1)}\leftarrow \argmin_{x_i} f({x_1}^{(t+1)},\ldots ,{x_{i-1}}^{(t+1)}, x_i,{x_{i+1}}^{(t)},\ldots ,{x_d}^{(t)})\)

  3. return \(\vvec{x}_{t_{max}}\)

The figure below shows the level set of a function, where every ring indicates the points in the domain of the function that have a specified function value \(\{\vvec{x}\mid f(\vvec{x})=c\}\). The plotted function has a minimum at the center of the ellipses. We start at \(\vvec{x}_0\) and then move to the minimum in direction of the vertical coordinate. That is, we move to the smallest ellipse (smallest diameter) we can touch in direction of the vertical coordinate. Then we move to the minimum in direction of the horizontal coordinate and we are already at our minimum where the method stops.

Figure made with TikZ

Coordinate descent minimizes the function value in every step:

\[ f(\vvec{x}^{(0)})\geq f(\vvec{x}^{(1)})\geq f(\vvec{x}^{(2)})\geq\ldots\]

Example: Rosenbrock function#

Let’s try to apply coordinate descent to find the minimum of the Rosenbrock function. From Example 8 we know the partial derivatives of the function

\[\begin{align*} f(\vvec{x})&= 100(x_2-x_1^2)^2 +(1-x_1)^2.\\ \frac{\partial}{\partial x_1}f(\vvec{x})&= 400x_1(x_1^2-x_2) +2(x_1-1)\\ \frac{\partial}{\partial x_2}f(\vvec{x})&= 200(x_2-x_1^2). \end{align*}\]

We compute the minima of the function in direction of the coordinates as would normally compute the minima of a polynomial: set the derivative to zero. Unfortunately, this doesn’t work well for the partial derivative subject to \(x_1\). Using symbolic solvers like sympy give three roots of the partial derivative, of which two are complex numbers. The one real-valued root is only defined for \(x_1\) no larger than a constant that is close to zero. Hence, this example doesn’t work well for coordinate descent, we exemplify the trajectory nevertheless until the roots are not defined anymore. The minimizer of the function subject to coordinate \(x_2\) is easily computed by setting the partial derivative to zero:

\[\begin{align*} \frac{\partial}{\partial x_2}f(\vvec{x})&=200(x_2-x_1^2)=0 &\Leftrightarrow x_2 =x_1^2 \end{align*}\]

Hence, we have update rules:

\[\begin{align*} \argmin_{x_1\in\mathbb{R}} f(x_1,x_2) &=\text{magic from sympy} \\ \argmin_{x_2\in\mathbb{R}} f(x_1,x_2) &=x_1^2 \end{align*}\]

The figure below shows the result of these update rules when starting at \((-1.5,-1)\). We see that in the beginning we make big steps towards the minimizer (1,1), but then those steps get smaller and smaller once we are in the valley.

_images/optimization_numerical_3_0.png

Gradient Descent#

If we can’t solve the system of equations given by FONC, also no coordinate-wise, but the function is differentiable, then we can apply gradient descent. Gradient descent is a strategy according to which you take a step in the direction which goes down the most steeply from where you stand.

Definition 13

The directional derivative of a real-valued function \(f:\mathbb{R}^d\rightarrow \mathbb{R}\) along the direction given by the vector \(\vvec{v}\in\mathbb{R}\) is the function

\[\lim_{h\rightarrow 0}\frac{f(\vvec{x}+h\vvec{v})-f(\vvec{x})}{h\lVert\vvec{v}\rVert}.\]

The directional derivative generalizes the concept of a derivative from one dimension to multiple ones. The plot below shows a 3D function and a direction vector \(\vvec{v}\) (red) at point \(\vvec{x}_0=(1,1)\). The directional derivative is then the tangent at point \(\vvec{x}_0\) in the direction of \(\vvec{v}\) (plotted on the left).

_images/optimization_numerical_5_0.png

We can also visualize the directional derivative as the slope of the tangent at point \(\vvec{x}_0\) when we project the graph of the 3D function on the direction given by \(\vvec{v}\). The red vector on the right plot indicates the direction \(\vvec{v}\) and the slope of the orange tangent on the left is the directional derivative in the direction of \(\vvec{v}\).

Also in multiple dimensions, the derivative is a linear function. In fact, it can be computed by the inner product of the direction vector and the gradient.

Theorem 8

Given a real-valued function \(f:\mathbb{R}^d\rightarrow \mathbb{R}\) whose partial derivatives exist and are continuous. The directional derivative in the direction of vector \(\vvec{v}\) can be computed as the dot product with the gradient:

\[\lim_{h\rightarrow 0}\frac{f(\vvec{x}+h\vvec{v})-f(\vvec{x})}{h\lVert\vvec{v}\rVert} = \nabla f(\vvec{x})^\top \frac{\vvec{v}}{\lVert\vvec{v}\rVert}\]

Proof. We show the result first for \(d=2\) dimensions.

\[\begin{align*} \frac{f(\vvec{x}+h\vvec{v})-f(\vvec{x})}{h\lVert\vvec{v}\rVert} &=\frac{f(x_1+hv_1,x_2 +hv_2)-f(x_1+hv_1,x_2)+f(x_1+hv_1,x_2)-f(\vvec{x})}{h\lVert\vvec{v}\rVert}\\ &= \frac{f(x_1+hv_1,x_2 +hv_2)-f(x_1+hv_1,x_2)}{h\lVert\vvec{v}\rVert}+\frac{f(x_1+hv_1,x_2)-f(\vvec{x})}{h\lVert\vvec{v}\rVert} \end{align*}\]

The term on the right can be transformed into

\[\begin{align*} \frac{f(x_1+hv_1,x_2)-f(\vvec{x})}{h\lVert\vvec{v}\rVert} \frac{v_1}{v_1} & = \frac{f(x_1+hv_1,x_2)-f(\vvec{x})}{hv_1} \frac{v_1}{\lVert\vvec{v}\rVert} \rightarrow \frac{\partial f(\vvec{x})}{\partial x_1} \frac{v_1}{\lVert\vvec{v}\rVert} \end{align*}\]

for \(h\rightarrow 0\). Now, we still need to transform the first part. To do so, we apply the mean value theorem, stating that there exists some \(0<t<h\) such that

\[\begin{align*} \frac{\partial f(x_1+hv_1,x_2 + tv_2)}{\partial x_2 + tv_2}\frac{v_2}{\lVert\vvec{v}\rVert}=\frac{f(x_1+hv_1,x_2 +hv_2)-f(x_1+hv_1,x_2)}{hv_2} \frac{v_2}{\lVert\vvec{v}\rVert} \end{align*}\]

If we now let \(h\rightarrow 0\), then we also have \(t\rightarrow 0\) and hence \((x_1+hv_1,x_2 + tv_2)\rightarrow (x_1,x_2)\). Since the partial gradients are continuous, we have for \(h\rightarrow 0\)

\[\frac{\partial f(x_1+hv_1,x_2 + tv_2)}{\partial x_2 + tv_2} \rightarrow \frac{\partial f(x_1,x_2)}{\partial x_2}.\]
As a result we have

\[\begin{align*} \frac{f(\vvec{x}+h\vvec{v})-f(\vvec{x})}{h\lVert\vvec{v}\rVert} &\rightarrow \frac{\partial f(\vvec{x})}{\partial x_2} \frac{v_2}{\lVert\vvec{v}\rVert} + \frac{\partial f(\vvec{x})}{\partial x_1} \frac{v_1}{\lVert\vvec{v}\rVert} = \nabla f(\vvec{x})^\top \frac{\vvec{v}}{\lVert\vvec{v}\rVert}. \end{align*}\]

If we have now more than two dimensions, then we can repeat this proof structure and inductively split always one partial gradient from the directional derivative.

The linearity of the gradient implies now that the gradient points into the direction of steepest ascent.

Theorem 9

The gradient of a real-valued function points into the direction of steepest ascent.

Proof. We want to find the direction into which the directional derivative increases the most.

\[\begin{align*} \max_{\vvec{v}}\lim_{h\rightarrow 0}\frac{f(\vvec{x}+h\vvec{v})-f(\vvec{x})}{h\lVert\vvec{v}\rVert} &= \max_{\vvec{v}} \nabla f(\vvec{x})^\top \frac{\vvec{v}}{\lVert\vvec{v}\rVert}\\ &= \cos(\sphericalangle (\vvec{v},\nabla f(\vvec{x}))) \lVert \nabla f(\vvec{x})\rVert \left\lVert \frac{\vvec{v}}{\lVert\vvec{v}\rVert}\right\rVert\\ &= \cos(\sphericalangle (\vvec{v},\nabla f(\vvec{x}))) \lVert \nabla f(\vvec{x})\rVert\\ &\leq \lVert \nabla f(\vvec{x})\rVert, \end{align*}\]

where the last inequality stems from the fact that the cosine function is at most equal to one and this maximum is attained if \(\vvec{v}\) points into the same direction as \(\nabla f(\vvec{x})\).

From the proof of the theorem above, it also follows that the negative gradient points into the direction of steepest descent. This is because the cosine of the angle between any vector \(\vvec{v}\) and \(\nabla f(\vvec{x})\) is minimized if we choose \(\vvec{v}=-\nabla f(\vvec{x})\).

Corollary 1

The negative gradient of a real-valued function points into the direction of steepest descent.

We have a look at the gradient direction in our 3D example.

_images/optimization_numerical_9_0.png

We define now the optimization method of gradient descent: in every iteration we take a step into the direction of steepest descent.

Algorithm 3 (Gradient Descent)

Input: the function \(f\) to minimize, step-size \(\eta\)

  1. \(\vvec{x}_0\gets\) Initialize(\(\vvec{x}_0\))

  2. for \(t\in\{1,\ldots,t_{max}-1\}\)

    1. \(\vvec{x}_{t+1}\leftarrow \vvec{x}_t - \eta \nabla f(\vvec{x}_t)\)

  3. return \(\vvec{x}_{t_{max}}\)

The parameter \(\eta\) doesn’t have to be a constant, it might also be a function that returns the step size depending on the amount of steps that have already performed. Setting the step size well is often a difficult task. The figure below shows how gradient descent makes the updates based on the local information.

Figure made with TikZ

Gradient descent decreases the function value in every step if the step size is small enough (because the negative gradient points in the direction of steepest descent)

\[f(\vvec{x}_0)\geq f(\vvec{x}_1)\geq f(\vvec{x}_2)\geq\ldots.\]
However, decreasing the function value in every step is in practice not neccessarily desirable. In particular in the beginning of the optimization, it’s useful to take larger steps to survey the landscape before converging to a local minimum.

Example: Rosenbrock Function#

We illustrate the effect of the step-size by means of the Rosenbrock function. We start at point \((-1.5,1.5)\) and choose a maximum number of 500 iterations. For our first run, we choose a small step-size \(\eta =0.0015\). In the beginning, this step size seems too large, since we can see it zig-zagging, but then it lands in the valley where the gradient norm is small and we make only very tiny steps. In the 500 iterations, we do not reach the minimum at \((1,1)\).

_images/optimization_numerical_11_0.png

We increase now the step-size to \(\eta=0.0035\). We see how in the beginning, the method makes big steps, which get smaller when the iterates land near the valley, but the zig-zagging is still strong and we don’t make good progress to the minimizer. For the Rosenbrock function, we need a lot of steps to converge to the minimizer, while the analytic method to compute the stationary points directly gives us the optimum. The problem with the Rosenbrock function is that gradient descent

  • oscillates in directions with steep gradients (anything that is not in the U-shaped valley of the Rosenbrock function)

  • converges slowly in directions with small gradients (when the iterates are in the valley)

_images/optimization_numerical_13_0.png

In practice, encountering a convergence behaviour as from the Rosenbrock functions is problematic, since we apply numerical optimization if we can’t compute the optimum analytically. If we don’t know where the optimum is, we might give up after the 500 steps where not much is happening, because the function has not been decreasing by much.

Gradient Descent with Momentum#

We have seen how gradient descent can be slow and inefficient, especially in loss landscapes with narrow canyons, such as in the Rosenbrock function. The idea of momentum is to accelerate learning by accumulating past gradients, building up speed in consistent directions and dampening oscillations.

A common analogy used to explain the idea of momentum is to imagine rolling a ball down a hilly landscape. Without momentum it stops at every bump, because we only behave according to the gradient, that adjusts locally. With momentum it builds speed in downhill directions and glides over small bumps.

Technically, momentum introduces only a small change. Instead of the gradient, we subtract now the velocity vector \(\vvec{v}_t\), a moving average of past gradients.

Algorithm 4 (Gradient Descent with Momentum)

Input: the function \(f\) to minimize, step-size \(\eta\), momentum coefficient \(\gamma\)

  1. \(\vvec{x}_0\gets\) Initialize(\(\vvec{x}_0\))

  2. \(\vvec{v_0}\gets \vvec{0}\)

  3. for \(t\in\{1,\ldots,t_{max}-1\}\)

    1. \(\vvec{v}_t = \gamma \vvec{v}_{t-1} + \eta \nabla_\theta f(\vvec{x}_t) \)

    2. \(\vvec{x}_{t+1} \leftarrow \vvec{x}_t - \mathbf{v}_t\)

  4. return \(\vvec{x}_{t_{max}}\)

Gradient descent with momentum is generally much faster than vanilla gradient descent. Setting the parameter \(\gamma \) in practice is not very challenging since \(\gamma\in\{0.9,0.99\}\) works usually well. If you are interested in a deeper dive into why momentum works, I recommend looking at this paper.

Example: Rosenbrock Function#

We plot below the trajectory of the gradient descent method with momentum, using a step-size of \(\eta = 0.0015\) and a value of \(\gamma=0.9\). We use 500 iteration steps, where we see that those are now sufficient to converge to the minimizer \((1,1)\).

_images/optimization_numerical_15_0.png