Convexity and Gradient Descent
Convexity
Convexity plays a vital role in the design of optimization algorithms. This is largely due to the fact that it is much easier to analyze and test algorithms in such a context.
- In other words, if the algorithm performs poorly even in the convex setting, typically we should not hope to see great results otherwise. Furthermore, even though the optimization problems in deep learning are generally nonconvex, they often exhibit some properties of convex ones near local minima.
Definition
Before convex analysis, we need to define convex sets and convex functions. They lead to mathematical tools that are commonly applied to machine learning.
Convex Sets
Sets are the basis of convexity. Simply put, a set
Something useful:
- Assume that
and are convex sets. Then is also convex.
- We can strengthen this result with little effort: given convex sets
, their intersection is convex.- To see that the converse is not true, consider two disjoint sets
. Now pick and . - The line segment in fig. 3 connecting
and needs to contain some part that is neither in nor in , since we assumed that . Hence the line segment is not in either, thus proving that in general unions of convex sets need not be convex.
- To see that the converse is not true, consider two disjoint sets
Typically the problems in deep learning are defined on convex sets. For instance,
Convex Functions
Now that we have convex sets we can introduce convex functions $f$. Given a convex set
To illustrate this let’s plot a few functions and check which ones satisfy the requirement. Below we define a few functions, both convex and nonconvex.
Jensen’s Inequality
Given a convex function
where
One of the common applications of Jensen’s inequality is to bound a more complicated expression by a simpler one.
-
For example, its application can be with regard to the log-likelihood of partially observed random variables. That is, we use
- since
. This can be used in variational methods. - Here
is typically the unobserved random variable, is the best guess of how it might be distributed, - and
is the distribution with integrated out. For instance, in clustering might be the cluster labels and is the generative model when applying cluster labels.
- since
Properties
Convex functions have many useful properties. We describe a few commonly-used ones below.
Local Minima Are Global Minima
This can be proved by contradiction:
Consider a convex function
- Suppose that
is a local minimum: there exists a small positive value so that for that satisfies we have
Now, let’s make an assumption:
Assume that the local minimum
Since
The distance between
To ensure
Solving for
However, according to the definition of convex functions, we have
which contradicts with our statement that
Below Sets of Convex Functions Are Convex
We can conveniently define convex sets via below sets of convex functions. Concretely, given a convex function
is convex.
Proof:
Take any two points
Consider any
Using the convexity of
Therefore,
Since any convex combination of points in
Convexity and Second Derivatives
Whenever the second derivative of a function
For instance, the function
Formally, a twice-differentiable one-dimensional function
Proof:
First, we prove the one-dimensional case.
Assume that
This inequality follows from the definition of convexity.
Since the second derivative is given by the limit over finite differences, it follows that:
Thus,
Conversely, suppose that
Let
By the mean value theorem, there exist
Since
Therefore:
Cross-multiplying:
Rewriting (carefully expand it out on the previous step):
This implies(rewrite
Thus,
Next, we introduce a lemma before proving the multidimensional case.
Lemma: A function
is convex.
To show that convexity of
Therefore,
Conversely, suppose
Thus,
Finally, using the lemma and the one-dimensional result, we can prove the multidimensional case.
A multidimensional function
According to the one-dimensional case, this holds if and only if:
for all
This condition is equivalent to
Constraints
One of the nice properties of convex optimization is that it allows us to handle constraints efficiently. That is, it allows us to solve constrained optimization problems of the form:
where
Question: Imagine a unit ball
Lagrangian
A ball inside a box; the ball will roll to the place that is lowest, and the forces of gravity will be balanced out with the forces that the sides of the box can impose on the ball.
- The gradient of the objective function (gravity) will be offset by the gradient of the constraint function (the ball needs to remain inside the box by virtue of the walls “pushing back”).
The above reasoning can be expressed via the following saddle point optimization problem:
- The variables
are Lagrange multipliers that ensure that constraints are properly enforced.
What are Lagrange Multipliers
If the constraint
Example
Suppose you have a function
- If you ignore the constraint, the minimum would be at
, but this violates the constraint . - The Lagrangian would add a term
, where . - The solution to the Lagrangian will find
(the smallest value of that satisfies the constraint ) and a corresponding value of that ensures the constraint is properly enforced.
Penalties
Rather than satisfying
In general, adding penalties is a good way of ensuring approximate constraint satisfaction. In practice, this turns out to be much more robust than exact satisfaction. Furthermore, for nonconvex problems, many of the properties that make the exact approach so appealing in the convex case (e.g., optimality) no longer hold.
Gradient Descent
Consider some continuously differentiable real-valued function
It is not unreasonable to assume that for small
Observations
- If the derivative
does not vanish, we make progress since . - A small enough
will make the higher-order terms become irrelevant.
This means if we use
Multivariate Gradient Descent
Let
Each partial derivative element
Choosing a suitable learning rate
Adaptive Methods
Picking a suitable learning rate
Please note that while these methods cannot be applied to deep learning directly due to the computational cost, they provide useful intuition into designing advanced optimization algorithms that mimic many of the desirable properties of these algorithms.
Newton’s Method
Reviewing the Taylor expansion of some function
To avoid cumbersome notation, we define
Problems with this dimensionality? For small
Now we have:
Objective: Find
We want to find the value of
Taking the derivative and setting it to zero, we have:
Why are we going through all this trouble?
Let’s say
Convergence Analysis
Let’s apply convergence analysis on Newton’s method, specifically how quickly the error
is minimized when .- Newton’s method updates the current point
using the rule:
The error at iteration
This equation gives us an approximation of the derivative
: first-order correction using the second derivative. : second-order correction using the third derivative.
Here,
This equation tells us how far the next iterate
The error
Recall the expanded form of
Substitution:
Simplification:
We can ignore
- At convergence, the linear term
drives the approximation initially. - As the error gets small, Newton’s method updates the solution in such a way that the remaining error becomes dominated by the quadratic term.
In other words, even though the quadratic term
We want to analyze the magnitude regardless of the sign, therefore taking the absolute value of both sides, we get:
With a bounded condition
Preconditioning
Preconditioning provides a cheaper alternative to using the full Hessian matrix. Instead of computing and storing the entire Hessian, preconditioning only uses the diagonal entries of the Hessian matrix. This dramatically simplifies the process while still providing useful information about the curvature of the function.
The update equation in this preconditioned version is:
where:
is the inverse of this diagonal matrix.
Real-World Analogy: Units Mismatch
Imagine you are optimizing a function where one variable is height in millimeters and another variable is height in kilometers.
If you try to use the same learning rate (step size) for both variables, it will create problems because the scales of the two variables are wildly different. This mismatch in units (millimeters vs. kilometers) will lead to slow convergence because one variable might need very small updates, while the other requires larger updates.
Preconditioning solves this problem by effectively allowing for different learning rates for each variable. It adjusts the step size based on the curvature (as estimated by the diagonal of the Hessian) so that each variable gets an update that is appropriate for its scale.