Convex Optimization Note

Convex Function
Definition

$\forall0 \leq t \leq 1, \forall x_{1}, x_{2} \in X, f (t x_{1} + (1 - t) x_{2}) \leq t f (x_{1}) + (1 - t) f (x_{2})$

If $f$ is differentiable on $(a, b)$ , then

$\forall a < x < y < b, f^{'} (x_{2}) \leq f^{'} (x_{2}) \Leftrightarrow f$ is convex

$\forall a < x < b, f^{''} (x) > 0 \Leftrightarrow f$ is convex

Link to original

Convex Set
Definition

A set of points is convex if it contains every line segment between two points in the set. $\forall x, y \in X, \forall t \in [0, 1], (1 - t) x + t y \in X$
Link to original

Feasible Region
Definition

The set of all possible points on an optimization problem that satisfy the problem’s constraints.
Link to original

Convex Optimization
Definition

$x min f (x)$ subject to $g (x) \leq 0, h (x) = 0$

An optimization problem in which the objective function is a Convex Function and the feasible set is a Convex Set.

Facts

Every local minimum is a global minimum

The convex feasible set condition is equivalent to the following conditions

the inequality constraint function $g$ is a Convex Function

the equality constraint function $h$ is an affine function

Link to original

Gradient Descent
Definition

An iterative optimization algorithm for finding a local minimum of a differentiable function

$x_{n + 1} = x_{n} - γ \nabla f (x_{n})$ where $γ$ is a learning rate

Examples

Solution of a linear system

Solve $A x = b$ with an MSE loss

The cost function is $f (x) = (A x - b)^{⊺} (A x - b)$ and its gradient is $\nabla f (x) = - 2 (b - A x)^{⊺}$

Then, solution is $x_{n + 1} = x_{n} + 2 γ A^{⊺} (b - A x_{n})$
Link to original

Newton's Method
Definition

An iterative algorithm for finding the roots of a differentiable function, which are solution to the equation $f (x) = 0$

Algorithm

Find the next point such that the Taylor series of the given point is 0 Taylor first approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n})$ The point such that the Taylor series is 0: $x_{n + 1} = x_{n} - \frac{f ( x _{n} )}{f ^{'} ( x _{n} )}$

multivariate version: $x_{n + 1} = f (x_{n}) - \nabla f (x_{n})^{- 1} f (x_{n})$

In convex optimization,

Find the minimum point^[its derivative is 0] of Taylor quadratic approximation. Taylor quadratic approximation: $f (x) ≃ f (x_{n}) + f^{'} (x_{n}) (x - x_{n}) + \frac{f ^{''} ( x _{n} )}{2} (x - x_{n})^{2}$ The derivative of the quadratic approximation: $f^{'} (x_{n}) + f^{''} (x_{n}) (x - x_{n})$ The minimum point of the quadratic approximation^[the point such that the derivative of the quadratic approximation is 0]: $x_{n + 1} = x_{n} - \frac{f ^{'} ( x _{n} )}{f ^{''} ( x _{n} )}$ multivariate version: $x_{n + 1} = x_{n} - \nabla^{2} f (x_{n})^{- 1} \nabla f (x_{n})$

Examples

Solution of a linear system

Solve $A x = b$ with an MSE loss

The cost function is $f (x) = (A x - b)^{⊺} (A x - b)$ and its gradient and hessian are $\nabla f (x) = - 2 (b - A x)^{⊺}$ , $\nabla^{2} f (x) = 2 A^{⊺} A$

Then, solution is $x_{n + 1} = x_{n} + (A^{⊺} A)^{- 1} A^{⊺} (b - A x_{n})$ If $(A^{⊺} A)$ is invertible, $x_{n + 1} = (A^{⊺} A)^{- 1} A^{⊺} b$ is a Least Square solution.
Link to original

Quasi-Newton Method
Definition

Secant equation

Methods

Broyden’s Method

Symmetric Rank-One Method

Broyden–Fletcher–Goldfarb–Shanno Algorithm

Davidon–Fletcher–Powell Formula

Broyden Family Method

Facts

DFP and BFGS is dual problem

Link to original

Broyden's Method
Definition

Solve $f (x) = 0$ using Newton’s Method with approximation of Jacobian Matrix

Algorithm

The essence of Broyden’s method is to approximate Jacobian Matrix of Newton’s Method using the average rate of change

Replace derivative $f^{'} (x_{n}) ≃ \frac{f ( x _{n} ) - f ( x _{n - 1} )}{x _{n} - x _{n - 1}}$ or gradient with an approximation $\nabla^{2} f (x_{n}) ≃ J_{n}$ Then, $J_{n} \cdot (x_{n} - x_{n - 1}) ≃ f (x_{n}) - f (x_{n - 1})$

Define $f_{n} := f (x_{n})$ $Δ f_{n} := f_{n} - f_{n - 1}$ $Δ x_{n} := x_{n} - x_{n - 1}$ $J_{n} ≃ \nabla^{2} f (x)$

Then, $J_{n} Δ x_{n} ≃ Δ f_{n}$

We want to find $A$ satisfying the following expression $J_{n} = J_{n - 1} + A$ $\Rightarrow Δ f_{n} = (J_{n - 1} + A) Δ x_{n} = J_{n - 1} Δ x_{n} + A Δ x_{n} \Rightarrow Δ f_{n} - J_{n - 1} Δ x_{n} = A Δ x_{n}$ by above approximation

To ensure stability of update, take a minimum modification $J_{n - 1}$ In other words, solve the minimization problem using method of Lagrange multipliers $min ∣∣ A ∣ ∣_{F} = ∣∣ J_{n} - J_{n - 1} ∣ ∣_{F}$ subject to $Δ f_{n} - J_{n - 1} Δ x_{n} = A Δ x_{n}$

Then, the solution is $A = \frac{Δ f _{n} - J _{n - 1} Δ x _{n}}{∣∣Δ x _{n} ∣∣} Δ x_{n}^{⊺}$ and $J_{n}$ is calculated by $J_{n - 1}$ $J_{n} = J_{n - 1} + \frac{Δ f _{n} - J _{n - 1} Δ x _{n}}{∣∣Δ x _{n} ∣∣} Δ x_{n}^{⊺}$

Now updating formula become $x_{n + 1} = x_{n} - J_{n}^{- 1} f (x_{n})$

Where $J_{n}^{- 1}$ can be easily calculated by Sherman–Morrison Formula

$J_{n}^{- 1} = J_{n - 1}^{- 1} + \frac{Δ x _{n} - J _{n - 1}^{- 1} Δ f _{n}}{Δ x _{n}^{T} J _{n - 1}^{- 1} Δ f _{n}} Δ x_{n}^{T} J_{n - 1}^{- 1}$ and $x_{n + 1} = x_{n} - J_{n}^{- 1} f (x_{n})$
Link to original

Symmetric Rank-One Method
Symmetric Rank-One Method

The Quasi-Newton Method method with a symmetric condition

Algorithm

Define $Δ f_{n} := f_{n} - f_{n - 1}$ $Δ x_{n} := x_{n} - x_{n - 1}$ $J_{n} ≃ \nabla^{2} f (x)$ $H_{n} = J_{n}^{- 1} ≃ \nabla^{2} f (x)^{- 1}$

Then, $Δ f_{n} = J_{n} Δ x_{n}$ $Δ x_{n} = H_{n} Δ f_{n}$

To assure the symmetry of the approximation of hessian matrix, let the perturbation matrix be $u u^{⊺}$ $H_{n} = H_{n - 1} + u u^{⊺}$

Then, $Δ x_{n} = H_{n} Δ f_{n} \Rightarrow Δ x_{n} = H_{n - 1} Δ f_{n} + u u^{⊺} Δ f_{n} \Rightarrow u u^{⊺} Δ f_{n} = Δ x_{n} - H_{n - 1} Δ f_{n}$

Now we can express $u = \frac{( Δ x _{n} - H _{n - 1} Δ f _{n} )}{u ^{⊺} Δ f _{n}}$ and $u u^{⊺} = \frac{( Δ x _{n} - H _{n - 1} Δ f _{n} ) ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ^{⊺}}{( u ^{⊺} Δ f _{n} ) ^{2}}$

Since $u u^{⊺}$ satisfying $u u^{⊺} Δ f_{n} = Δ x_{n} - H_{n - 1} Δ f_{n}$ is unique(because $u (u^{⊺} Δ f_{n}) = ⟨ u^{⊺}, Δ f_{n} ⟩ u$ is the scaling of $u$ ), we can get $(u^{⊺} Δ f_{n})^{2}$ by multiplying $Δ f_{n}^{⊺}$ to the both side $(u^{⊺} Δ f_{n})^{2} = Δ f_{n}^{⊺} u u^{⊺} Δ f_{n} = Δ f_{n}^{⊺} (Δ x_{n} - H_{n - 1} Δ f_{n})$

by substituting $(u^{⊺} Δ f_{n})^{2}$ to original expression, $u u^{⊺} = \frac{( Δ x _{n} - H _{n - 1} Δ f _{n} ) ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ^{⊺}}{Δ f _{n}^{⊺} ( Δ x _{n} - H _{n - 1} Δ f _{n} )}$

Therefore, $H_{n} = H_{n - 1} + \frac{( Δ x _{n} - H _{n - 1} Δ f _{n} ) ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ^{⊺}}{( Δ x _{n} - H _{n - 1} Δ f _{n} ) ^{⊺} Δ f _{n}}$ Since the denominator is a scalar and $x_{n + 1} = x_{n} - H_{n} f (x_{n})$
Link to original

Broyden–Fletcher–Goldfarb–Shanno Algorithm
Definition

The quasi-Newton method with symmetric and positive-definite condition Update inverse of hessian directly

Algorithm

Define $Δ f_{n} := f_{n} - f_{n - 1}$ $Δ x_{n} := x_{n} - x_{n - 1}$ $J_{n} ≃ \nabla^{2} f (x)$ $H_{n} = J_{n}^{- 1} ≃ \nabla^{2} f (x)^{- 1}$

Then, $Δ f_{n} = J_{n} Δ x_{n}$ $Δ x_{n} = H_{n} Δ f_{n}$

We want to find $A$ satisfying the following expression with the symmetric and positive-definite condition $H_{n} = H_{n - 1} + A$ $\Rightarrow Δ x_{n} = (H_{n - 1} + A) Δ f_{n} = H_{n - 1} Δ f_{n} + A Δ f_{n} \Rightarrow A Δ f_{n} = Δ x_{n} - H_{n - 1} Δ f_{n}$ by the definition

So, Solve the minimization problem with the symmetric and positive-definite conditions using method of Lagrange multipliers $min ∣∣ A ∣ ∣_{w}^{2} = ∣∣ M A M^{⊺} ∣ ∣_{F}^{2}$ subject to $A = A^{⊺}$ and $A Δ f_{n} = Δ x_{n} - H_{n - 1} Δ f_{n}$ where $W = M^{⊺} M := J_{k}$

Then, the solution is $A = \frac{( W ^{- 1} Δ f _{n} ) ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ^{⊺} + ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ( W ^{- 1} Δ f _{n} ) ^{⊺}}{Δ f _{n}^{⊺} ( W ^{- 1} Δ f _{n} )} - \frac{Δ f _{n}^{⊺} ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ( W ^{- 1} Δ f _{n} ) ( W ^{- 1} Δ f _{n} ) ^{⊺}}{( Δ f _{n}^{⊺} ( W ^{- 1} Δ f _{n} ) ) ^{2}}$ where $W^{- 1} Δ f_{n} = Δ x_{n}$ by the definition $= \frac{Δ x _{n} ( Δ x _{n} - H _{n - 1} Δ f _{n} ) ^{⊺} + ( Δ x _{n} - H _{n - 1} Δ f _{n} ) Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}} - \frac{Δ f _{n}^{⊺} ( Δ x _{n} - H _{n - 1} Δ f _{n} ) Δ x _{n} Δ x _{n}^{⊺}}{( Δ f _{n}^{⊺} Δ x _{n} ) ^{2}}$ $= (I - \frac{Δ x _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) H_{n - 1} (I - \frac{Δ f _{n} Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) + \frac{Δ x _{n} Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}$

Therefore, updating formula is $H_{n} = H_{n - 1} + (I - \frac{Δ x _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) H_{n - 1} (I - \frac{Δ f _{n} Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) + \frac{Δ x _{n} Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}$

And by the Sherman–Morrison Formula, $J_{n} = J_{n - 1} + \frac{Δ f _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}} - \frac{J _{n - 1} Δ x _{n} ( J _{n - 1} Δ x _{n} ) ^{⊺}}{Δ x _{n}^{⊺} J _{n - 1} Δ x _{n}}$ and $x_{n + 1} = x_{n} - H_{n} f (x_{n})$
Link to original

Davidon–Fletcher–Powell Formula
Definition

The quasi-Newton method with symmetric and positive-definite condition Update hessian and find inverse using Sherman–Morrison Formula

Algorithm

Define $Δ f_{n} := f_{n} - f_{n - 1}$ $Δ x_{n} := x_{n} - x_{n - 1}$ $J_{n} ≃ \nabla^{2} f (x)$ $H_{n} = J_{n}^{- 1} ≃ \nabla^{2} f (x)^{- 1}$

Then, $Δ f_{n} = J_{n} Δ x_{n}$ $Δ x_{n} = H_{n} Δ f_{n}$

We want to find $A$ satisfying the following expression with the symmetric and positive-definite condition $J_{n} = J_{n - 1} + A$ $\Rightarrow Δ f_{n} = (J_{n - 1} + A) Δ x_{n} = J_{n - 1} Δ x_{n} + A Δ x_{n} \Rightarrow A Δ x_{n} = Δ f_{n} - J_{n - 1} Δ x_{n}$ by the definition

So, solve the minimization problem with the symmetric and positive-definite conditions using method of Lagrange multipliers $min ∣∣ A ∣ ∣_{w}^{2} = ∣∣ M A M^{⊺} ∣ ∣_{F}^{2}$ subject to $A = A^{⊺}$ and $A Δ x_{n} = Δ f_{n} - J_{n - 1} Δ x_{n}$ where $W = M^{⊺} M := H_{k}$

Then, the solution is $A = \frac{( W ^{- 1} Δ x _{n} ) ( Δ f _{n} - J _{n - 1} Δ x _{n} ) ^{⊺} + ( Δ f _{n} - J _{n - 1} Δ x _{n} ) ( W ^{- 1} Δ x _{n} ) ^{⊺}}{Δ x _{n}^{⊺} ( W ^{- 1} Δ x _{n} )} - \frac{Δ x _{n}^{⊺} ( Δ f _{n} - J _{n - 1} Δ x _{n} ) ( W ^{- 1} Δ x _{n} ) ( W ^{- 1} Δ x _{n} ) ^{⊺}}{( Δ x _{n}^{⊺} ( W ^{- 1} Δ x _{n} ) ) ^{2}}$ where $W^{- 1} Δ x_{n} = Δ f_{n}$ by the definition $= \frac{Δ f _{n} ( Δ f _{n} - J _{n - 1} Δ x _{n} ) ^{⊺} + ( Δ f _{n} - J _{n - 1} Δ x _{n} ) Δ f _{n}^{⊺}}{Δ x _{n}^{⊺} Δ f _{n}} - \frac{Δ x _{n}^{⊺} ( Δ f _{n} - J _{n - 1} Δ x _{n} ) Δ f _{n} Δ f _{n}^{⊺}}{( Δ x _{n}^{⊺} Δ f _{n} ) ^{2}}$ $= (I - \frac{Δ f _{n} Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) J_{n - 1} (I - \frac{Δ x _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) + \frac{Δ f _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}$ Therefore, updating formula is $J_{n} = J_{n - 1} + (I - \frac{Δ f _{n} Δ x _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) J_{n - 1} (I - \frac{Δ x _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}) + \frac{Δ f _{n} Δ f _{n}^{⊺}}{Δ f _{n}^{⊺} Δ x _{n}}$

And by the Sherman–Morrison Formula, $H_{n} = H_{n - 1} + \frac{Δ x _{n} Δ x _{n}^{⊺}}{Δ x _{n}^{⊺} Δ f _{n}} - \frac{H _{n - 1} Δ f _{n} ( H _{n - 1} Δ f _{n} ) ^{⊺}}{Δ f _{n}^{⊺} H _{n - 1} Δ f _{n}}$ and $x_{n + 1} = x_{n} - H_{n} f (x_{n})$
Link to original

Broyden Family Method
Definition

Broyden family update is obtained from combining the BFGS and the DFP method

Algorithm

$B_{n} = (1 - α_{n - 1}) J_{n - 1}^{BFGS} + α_{n - 1} J_{n - 1}^{D FP}$ where $α \in [0, 1]$

Facts

If the $α$ of Broyden Family is a $\frac{Δ f _{n}^{⊺} Δ x _{n}}{Δ f _{n}^{⊺} Δ x _{n} - Δ x _{n}^{⊺} J _{k - 1} Δ x _{n}}$ , then the update is equal to Symmetric Rank-One Method $B_{n} = B_{n - 1} + \frac{( Δ f _{n} - J _{n - 1} Δ x _{n} ) ( Δ f _{n} - J _{n - 1} Δ x _{n} ) ^{⊺}}{( Δ f _{n} - J _{n - 1} Δ x _{n} ) ^{⊺} Δ x _{n}}$

Link to original

Method of Lagrange Multipliers
Definition

$L (x, λ) := f (x) + λ g (x)$

The method of Lagrange multipliers is a method for finding the local maxima and minima of a function subject to equality constraints

Solution

Given general problem $x min f (x) subject to g (x) = 0$

In order to find the maximum or minimum of a function $f (x)$ subject to the equality constraint $g (x) = 0$ , find the stationary point of $L (x, λ)$ considered as a function of $x$ and the Lagrange multiplier $λ$ . In other words, all partial derivatives should be zero. $\nabla_{x} L (x, λ) = 0, \nabla_{λ} L (x, λ) = 0 ⟺ \nabla_{x} f (x) = λ \nabla_{x} g (x), g (x) = 0$ The solution of the constrained optimization problem is always a Saddle Point of the Lagrangian function $L (x, λ)$ .

Facts

The gradient of constraint $\nabla_{x} g (x)$ and the gradient of the function $\nabla_{x} f (x)$ should be in same or directions.

Proof $\nabla_{x} f (x) + λ \nabla_{x} g (x) = 0 ⟹ \nabla_{x} f (x) = - λ \nabla_{x} g (x)$

If $\nabla_{x} f (x)$ and $\nabla_{x} g (x)$ don’t have same or directions, then there is the direction decreasing $f (x)$ subject to $g (x) = 0$

If $\nabla_{x} f (x)$ and $\nabla_{x} g (x)$ have the same or opposite direction, then there is no direction decreasing $f (x)$ subject to $g (x) = 0$

Link to original

Infeasible Start Newton's Method
Definition

Algorithm

Solve the minimization problem with the equality condition $min f (x)$ subject to $A x = b$

Use a Newton’s Method to find the optimal point $x^{*}$ satisfying above conditions

The quadratic Taylor approximation of $f (x + (Δ x))$ at the point $x$ is $f (x + (Δ x)) ≃ f (x) + \nabla f (x) (Δ x) + \frac{1}{2} (Δ x)^{⊺} \nabla^{2} f (x) (Δ x)$

The minimization problem is $min f (x) + \nabla f (x) (Δ x) + \frac{1}{2} (Δ x)^{⊺} \nabla^{2} f (x) (Δ x)$ subject to $A ((Δ x) + x) = b$

The Lagrangian function is $L (x, w) = f (x) + \nabla f (x) (Δ x) + \frac{1}{2} (Δ x)^{⊺} \nabla^{2} f (x) (Δ x) + w^{⊺} (A (Δ x) + A x - b)$ where $w$ is a Lagrangian multiplier

The optimality condition is $\nabla_{(Δ x)} L (x, (Δ w)) = \nabla^{2} f (x) (Δ x) + \nabla f (x) + w^{⊺} A = 0, A (Δ x) = - (A x - b)$

In a matrix form, $[\nabla^{2} f (x) A A^{⊺} 0] [(Δ x) w] = - [\nabla f (x) A x - b]$

we solve this linear equation to find the Newton direction $(Δ x)$
Link to original

Backtracking Line Search
Definition

The method to determine the amount to move along a given search direction (learning rate).

Algorithm

We want to find a learning rate minimizing the cost function $t > 0 argmin f (x + t Δ x)$

Exact line search

The optimal $t$ can be obtained by using grid search in the interval $[0, 1]$ However, it causes substantial computational costs.

Backtracking line search

Identify a value of $α$ that provides a reasonable amount of improvement in the objective function, rather than to find the actual minimizing value $t$

The backtracking line search starts with a large $t$ and iteratively shrinks it until the value is enough to provide a decrease in the objective function.

Calculate the gradient of objective function with respect to learning rate $t$ at the point $t = 0$ $\nabla_{t} f (x + t Δ x) ≃ f (x) + \nabla f (x) t Δ x$

and define a parameter $α \in (0, \frac{1}{2})$ which modify the slope of $f (x) + \nabla f (x) t Δ x$

Now, the line $f (x) + α \nabla f (x) t Δ x$ is used to check whether the decrease in the objective function is enough

Start from large $t$ usually $1$ , and iteratively shrinks it by multiplying $β \in (0, 1)$ to $t$ until the value is lower than the line

If the condition $f (x + t Δ x) < f (x) + α \nabla f (x) t Δ x$ is satisfied, then use $t$ as a learning rate
Link to original

Duality
Definition

Optimization problems may be viewed from either the primal problem or the dual problem.

Properties

Consider the following optimization problem

The original problem is called a Primal problem $x min f (x)$ subject to $g (x) \leq 0, h (x) = 0$ Let a $D$ be the domain of the problem and the $X \in D$ be its feasible set

The Lagrangian function corresponding to the primal problem is called a Lagrangian primal function $L (x, μ, λ) = f (x) + μg (x) + λh (x)$ where the variable $x$ is called a Primal variable, and $μ, λ$ are called Dual variables

Since $\forall x \in X \Rightarrow g (x) \leq 0 \land h (x) = 0$ by the constraints, the Lagrangian primal function is always lower than or equal to the $f (x)$ for all $x$ in the feasible set $L (x, λ, μ) = f (x) + μg (x) + λh (x) \leq f (x), \forall x \in X$

Let a primal optimal value $f (x^{*})$ be the feasible optimal value of the primal problem $f (x^{*}) = x \in X min f (x)$

Let a Lagrange dual function $q (μ, λ)$ be the minimum of the Lagrangian primal function for all $x$ in the domain $q (μ, λ) := x \in D min L (x, μ, λ) = x \in D min (f (x) + λh (x) + μg (x)), \forall μ > 0, \forall λ$

We say another optimization problem for the Lagrange dual function $q (μ, λ)$ a Dual problem $max q (μ, λ)$ subject to $μ \geq 0$

Since the Lagrange dual function minimize the Lagrangian primal function $\forall x \in D$ , it is always lower than or equal to the primal optimal value that minimizes that $\forall x \in X$ $q (μ, λ) = x \in D min L (x, μ, λ) \leq x \in X min L (x, μ, λ) = L (x^{*}, μ, λ) = f (x^{*}) + λh (x^{*}) + μg (x^{*}) \leq f (x^{*}) \forall μ > 0, \forall λ$

Let a dual optimal value be the maximum of the dual problem $q (μ, λ)$ $q (μ^{*}, λ^{*}) := μ, λ max q (μ, λ) = μ, λ max (x \in D min L (x, μ, λ)) \forall μ > 0, \forall λ$

Then, by the above inequality, $q (μ^{*}, λ^{*})$ become the lower bound of $f (x^{*})$ $q (μ, λ) \leq q (μ^{*}, λ^{*}) \leq f (x^{*})$ This property is called a Weak Duality $q (μ^{*}, λ^{*}) \leq f (x^{*})$

and the difference between the dual optimal value and the primal optimal value is called a Duality Gap $f (x^{*}) - q (μ^{*}, λ^{*})$

If the dual optimal value and the primal optimal value are the same, then this property is called a Strong Duality $q (μ^{*}, λ^{*}) = f (x^{*})$

Solution for the Problem Satisfying Strong Duality

Find the Lagrange dual function $q (μ, λ) := x \in D min L (x, μ, λ)$ by using the gradient condition $\nabla_{x} L (x, μ, λ) = 0$ find the dual optimal variables $μ^{*}, λ^{*}$ from the dual problem $μ, λ max q (μ, λ)$ by using the gradient condition $\nabla_{μ, λ} q (μ, λ) = 0$ obtain the primal optimal variable $x^{*}$ using the dual optimal variables $μ^{*}, λ^{*}$

Visual Explanation of Duality

Consider an optimization problem $x min f (x)$ subject to $g (x) \leq 0$

Let a set $G$ where $G = {(g (x), f (x)) ∣ x \in D}$

And let $p^{*}$ be the primal optimal solution $p^{*} = in f ({t ∣ (u, t) \in G, u \leq 0})$

By definition of Lagrange dual function, $g (λ) = x \in D in f (f (x) + λ g (x)) = x \in D in f (t + λ u) = in f ({(λ, 1)^{⊺} (u, t) ∣ (u, t) \in G})$ where $λ \geq 0$

Consider $u - t$ plane Where the feasible region is of $G \cap u \leq 0$

The optimal value $p^{*}$ is given by the tangent horizontal line that indicates the minimum value when $u \leq 0$

Since $λ \geq 0$ , the line always has negative slope. So, for a given $λ$ , the line $t = - λ u + g (λ)$ provides the lower bound of the objective function $t = f (x)$

The figure shows the hyperplanes for different values $λ$ We see that $λ^{*}$ gives the best lower bound $d^{*}$ , also see that Strong Duality does not hold because $p^{*} \neq = d^{*}$

Optimality Property

$\exists \overset{x}{ˉ}, \exists \overset{ˉ}{λ}, \exists \overset{μ}{ˉ}, q (\overset{μ}{ˉ}, \overset{ˉ}{λ}) = f (\overset{x}{ˉ}) \Rightarrow q (\overset{μ}{ˉ}, \overset{ˉ}{λ}) = q (μ^{*}, λ^{*}) \land f (\overset{x}{ˉ}) = f (x^{*})$ If we have feasible solutions to the primal and dual problems such that their respective objective functions are equal, then these solutions are optimal to their respective problems

Proof Let a prime variable be $\overset{x}{ˉ}$ and dual variables be $\overset{ˉ}{λ}, \overset{μ}{ˉ}$ satisfying $q (μ, λ) = f (x)$

Since the primal optimal $x^{*}$ minimize $f (x^{*})$ , the value $f (\overset{x}{ˉ})$ with the prime variable $\overset{x}{ˉ}$ always greater than or equal to the $f (x^{*})$ $q (μ, λ) \leq q (μ^{*}, λ^{*}) \leq f (x^{*}) \leq f (\overset{x}{ˉ})$

And by the assumption, the first term and the last term of the inequality become the same. $q (\overset{μ}{ˉ}, \overset{ˉ}{λ}) = q (μ^{*}, λ^{*}) = f (x^{*}) = f (\overset{x}{ˉ})$

Then, by the equality $q (\overset{μ}{ˉ}, \overset{ˉ}{λ}) = q (μ^{*}, λ^{*})$ , the dual variable $\overset{ˉ}{λ}, \overset{μ}{ˉ}$ are the dual optimal, and by the equality $f (x^{*}) = f (\overset{x}{ˉ})$ , the primal variable $\overset{x}{ˉ}$ is the primal optimal And the two are the same.

Therefore, values $\overset{x}{ˉ}, \overset{ˉ}{λ}, \overset{μ}{ˉ}$ are optimal.

Facts

If the primal is a minimization problem, then the dual is a maximization problem, and vice versa

The solution to the primal is an upper bound to the solution of the dual, and the solution of the dual is a lower bound to the solution of the primal, and vice versa

For the minimization primal problem, the dual optimal is a lower bound of the primal optimal

Link to original

Duality Gap
Definition

$p^{*} - d^{*} \Leftrightarrow f (x^{*}) - q (μ^{*}, λ^{*})$

The difference between the primal and dual solutions
Link to original

Weak Duality
Definition

For a minimization problem, $q (d^{*}) \leq f (p^{*}) \Leftrightarrow q (μ^{*}, λ^{*}) \leq f (x^{*})$ The solution to the primal problem is always greater than or equal to the solution to the dual problem

The Duality Gap of an optimization problem is always greater than or equal to zero
Link to original

Strong Duality
Definition

$f (p^{*}) = q (d^{*}) \Leftrightarrow f (x^{*}) = q (μ^{*}, λ^{*})$

The primal optimal objective and the dual optimal objective are equal. The Duality Gap of an optimization problem is zero.

Facts

If a strong duality holds, the dual optimal $\Leftrightarrow$ primal optimal

Link to original

Slater's Condition
Definition

For a convex problem, there exists a point $x$ such that $h (x) = 0$ and $g (x) < 0$
Link to original

Karush-Kun-Tucker Conditions
Definition

By allowing inequality constraints, generalizes the method of Lagrange multiplier, which allows only equality constraints.

Given general problem $x min f (x)$ subject to $g (x) \leq 0, h (x) = 0$

The KKT conditions are

Stationarity: $\nabla L (x, μ, λ) = \nabla f (x) + λ \nabla h (x) + μ \nabla g (x) = 0$

Primal feasibility: $h (x) = 0, g (x) \leq 0$

Dual feasibility: $μ \geq 0$

Complementary slackness: $μg (x) = 0$

Facts

The gradient of inequality constraint $\nabla_{x} g (x)$ and the gradient of the function $\nabla_{x} f (x)$ should be in opposite directions.

If $\nabla_{x} f (x)$ and $\nabla_{x} g (x)$ don’t have opposite directions, then there is the area(direction) decreasing both $f (x)$ and $g (x)$

For a problem with Strong Duality, $x^{*}, μ^{*}, λ^{*}$ are optimal $\Rightarrow$ $x^{*}, μ^{*}, λ^{*}$ satisfy the KKT conditions

For a convex optimization, $x^{*}, μ^{*}, λ^{*}$ satisfy the KKT condition $\Rightarrow$ Strong Duality holds

Proof Since $f (x), h (x), g (x)$ are all convex function, the Lagrangian primal function $L (x, μ, λ) = f (x) + λh (x) + μg (x)$ is also convex

Let $x^{*}, μ^{*}, λ^{*}$ be the values satisfy KKT conditions. By the stationarity condition, $\nabla L (x^{*}, μ^{*}, λ^{*}) = \nabla f (x^{*}) + λ \nabla h (x^{*}) + μ \nabla g (x^{*}) = 0$ So, $x^{*}$ is a local minimum of $L (x, μ, λ)$ and by the convexity, it is also global minimum Therefore, $L (x^{*}, μ^{*}, λ^{*}) = x min L (x, μ^{*}, λ^{*}) = q (μ^{*}, λ^{*})$ ^[Lagrange dual function]

Also, by the complementary slackness and primal feasibility conditions, $μ^{*} g (x^{*}) = 0, h (x^{*}) = 0$ So, $L (x^{*}, μ^{*}, λ^{*}) = f (x^{*}) + λh (x^{*}) + μg (x^{*}) = f (x^{*})$ Therefore, the dual optimal value and the primal optimal value is the same. In other words, satisfy the Strong Duality condition $q (μ^{*}, λ^{*}) = f (x^{*})$

$∴ x^{*}, μ^{*}, λ^{*}$ are optimal

For a problem with Slater’s Condition (ensure Strong Duality and convexity), and $x^{*}, μ^{*}, λ^{*}$ satisfy the KKT condition $\Leftrightarrow$ $x^{*}, μ^{*}, λ^{*}$ are optimal

Proof Let $x^{*}, μ^{*}, λ^{*}$ be the primal and dual solutions

Since $μ^{*}, λ^{*}$ are the dual optimal, the dual optimal value $q (μ^{*}, λ^{*})$ can be expressed as an expression in regard to $x$ $q (μ^{*}, λ^{*}) = x min (f (x) + λ^{*} h (x) + μ^{*} g (x)) = f (\overset{x}{ˉ}) + λ^{*} h (\overset{x}{ˉ}) + μ^{*} g (\overset{x}{ˉ})$ where $\overset{x}{ˉ} := x argmin (f (x) + λ^{*} h (x) + μ^{*} g (x))$

Since $x^{*}$ is a feasible value, the $q (μ^{*}, λ^{*})$ with $x^{*}$ , $L (x^{*}, μ^{*}, λ^{*})$ is always greater than or equal to the $q (μ^{*}, λ^{*})$ with $\overset{x}{ˉ}$ $f (\overset{x}{ˉ}) + λ^{*} h (\overset{x}{ˉ}) + μ^{*} g (\overset{x}{ˉ}) \leq f (x^{*}) + λ^{*} h (x^{*}) + μ^{*} g (x^{*}) = L (x^{*}, μ^{*}, λ^{*})$ and by the constraints of Lagrangian primal function, $h (x) = 0 \Rightarrow λ^{*} h (x^{*}) = 0$ and $g (x) \leq 0 \land μ \geq 0 \Rightarrow μ^{*} g (x^{*}) \leq 0$ So, $f (x^{*}) + λ^{*} h (x^{*}) + μ^{*} g (x^{*}) \leq f (x^{*})$

In summary $q (μ^{*}, λ^{*}) = x min (f (x) + λ^{*} h (x) + μ^{*} g (x)) = f (\overset{x}{ˉ}) + λ^{*} h (\overset{x}{ˉ}) + μ^{*} g (\overset{x}{ˉ}) \leq L (x^{*}, μ^{*}, λ^{*}) = f (x^{*}) + λ^{*} h (x^{*}) + μ^{*} g (x^{*}) \leq f (x^{*})$

Since Slater’s Condition ensures Strong Duality $q (μ^{*}, λ^{*}) = f (x^{*})$ , the inequality become the equality $q (μ^{*}, λ^{*}) = x min (f (x) + λ^{*} h (x) + μ^{*} g (x)) = f (\overset{x}{ˉ}) + λ^{*} h (\overset{x}{ˉ}) + μ^{*} g (\overset{x}{ˉ}) = L (x^{*}, μ^{*}, λ^{*}) = f (x^{*}) + λ^{*} h (x^{*}) + μ^{*} g (x^{*}) = f (x^{*})$ Therefore, $f (x^{*}) + λ^{*} h (x^{*}) + μ^{*} g (x^{*}) = f (x^{*}) \Rightarrow μ^{*} g (x^{*}) = 0$ , so the complementary slackness condition is satisfied Also, $x min L (x, μ^{*}, λ^{*}) = L (x^{*}, μ^{*}, λ^{*}) \Rightarrow x = x argmin L (x, μ^{*}, λ^{*}) \Rightarrow \nabla L (x^{*}, μ^{*}, λ^{*}) = \nabla f (x^{*}) + λ \nabla h (x^{*}) + μ \nabla g (x^{*}) = 0$ , so the stationary condition is satisfied Finally, since $x^{*}, μ^{*}, λ^{*}$ are the primal and dual solutions, the constraints of Lagrangian primal and dual function, satisfy the primal and dual feasibility condition

Link to original

Barrier Method
Definition

An interior point optimization technique used to solve constrained optimization problems. It involves adding a barrier or penalty term to the objective function that penalizes solutions for being outside the feasible region.

Algorithm

Consider the optimization problem with an inequality constraint

$x min f (x)$ subject to $g (x) \leq 0, h (x) = 0$

The KKT Conditions of the problem are the following

$\nabla f (x) + λ \nabla h (x) + μ \nabla g (x) = 0$ ^[Stationarity]

$h (x) = 0, g (x) \leq 0$ ^[Primal feasibility]

$μ \geq 0$ ^[Dual feasibility]

$μg (x) = 0$ ^[Complementary slackness]

Interior point methods solve the problem whose complementary slackness condition is relaxed $μg (x) = - t$ where $t > 0$ $\Rightarrow μ = - \frac{t}{g ( x )}$

the stationary condition become $\nabla f (x) + λ \nabla h (x) - t \frac{\nabla g ( x )}{g ( x )} = 0 \Leftrightarrow \nabla f (x) + λ \nabla h (x) - t ln (- g (x)) = 0$ where $- t ln (- g (x))$ is called a logarithmic barrier function

As $t$ large, the logarithmic barrier function functions like an indicator function.

Now, by the definition of log function, $g (x)$ should be negative or zero, so the primal feasibility condition holds. Also, by the relaxed complementary slackness conditions, $μ$ is positive or zero, so the dual feasibility conditions holds

Now the conditions have to be satisfied are the following

$\nabla f (x) + λ \nabla h (x) - t \nabla ln (- g (x)) = 0$

$h (x) = 0$

$- g (x) > 0$

$t > 0$

and the problem with inequality become the problem with equality $x min f (x) - t ln (- g (x))$ subject to $h (x) = 0$

It can be solved using Newton’s Method

Newton’s Method’s approximation becomes unstable if $t$ is too small and the current point is far from optimal. So, we initially choose large $t$ and iteratively shrink it.

Analytic Central Point of Feasible Region

If $t$ is very large, then the optimization problem become the following. $x min - t ln (- g (x))$ subject to $h (x) = 0$

Since $ln$ is a monotonic increasing function, the problem can be transformed as $x min g (x)$ subject to $h (x) = 0$

So, it finds the analytic central point of feasible region (gravity of the feasible region)

Phase 1 method

In the iteration, we have to find the initial points satisfying constraints $t_{0}, x_{0,} λ_{0}$ satisfying $t_{0} > 0, g (x_{0}) < 0$

To ensure the $g (x_{0}) < 0$ condition, transform the optimization problem only at the initial stage. $x min f (x) - t_{0} ln (s - g (x))$ subject to $h (x) = 0, s = 0$

Now the problem can be solved using Infeasible Start Newton’s Method
Link to original

Primal-Dual Interior-Point Method
Definition

An interior point optimization technique used to solve constrained optimization problems. Converging and accuracy are faster than Barrier Method

Algorithm

Consider the optimization problem with an inequality constraint

$x min f (x)$ subject to $g (x) \leq 0, h (x) = 0$

The KKT Conditions of the problem are the following

$\nabla f (x) + λ \nabla h (x) + μ \nabla g (x) = 0$ ^[Stationarity]

$h (x) = 0, g (x) \leq 0$ ^[Primal feasibility]

$μ \geq 0$ ^[Dual feasibility]

$μg (x) = 0$ ^[Complementary slackness]

Interior point methods solve the problem whose complementary slackness condition is relaxed $μg (x) = - t$ where $t > 0$

Instead of substituting $μ$ in the stationary condition with $- \frac{t}{g ( x )}$ like the Barrier Method, Primal-dual interior method directly solve the problem with Newton’s Method

Now, the conditions have to be satisfied are the following

$\nabla f (x) + λ \nabla h (x) + μ \nabla g (x) = 0$

$μg (x) + t = 0$

$h (x) = 0$

$t > 0$

and the problem with inequality become the problem with equality $x min f (x)$ subject to $μg (x) + t = 0, h (x) = 0$

It can be solved using Newton’s Method
Link to original

My Knowledge Base

Explorer

Convex Optimization Note

Convex Function

Definition

Convex Set

Definition

Feasible Region

Definition

Convex Optimization

Definition

Facts

Gradient Descent

Definition

Examples

Solution of a linear system

Newton's Method

Definition

Algorithm

Examples

Solution of a linear system

Quasi-Newton Method

Definition

Methods

Facts

Broyden's Method

Definition

Algorithm

Symmetric Rank-One Method

Symmetric Rank-One Method

Algorithm

Broyden–Fletcher–Goldfarb–Shanno Algorithm

Definition

Algorithm

Davidon–Fletcher–Powell Formula

Definition

Algorithm

Broyden Family Method

Definition

Algorithm

Facts

Method of Lagrange Multipliers

Definition

Solution

Facts

Infeasible Start Newton's Method

Definition

Algorithm

Backtracking Line Search

Definition

Algorithm

Exact line search

Backtracking line search

Duality

Definition

Properties

Solution for the Problem Satisfying Strong Duality

Visual Explanation of Duality

Optimality Property

Facts

Duality Gap

Definition

Weak Duality

Definition

Strong Duality

Definition

Facts

Slater's Condition

Definition

Karush-Kun-Tucker Conditions

Definition

Facts

Barrier Method

Definition

Algorithm

Analytic Central Point of Feasible Region

Phase 1 method

Primal-Dual Interior-Point Method

Definition

Algorithm