# Multivariate optimization

## Inhaltsverzeichnis

## Introduction

In this thesis we want to continue with optimization problematic, which appears in many real situation. So in the beginning we recall optimization problem. We have to arrange our notation.

### Notation

Use the following throughout the course:

- gradient of
- Hessian matrix of
- is a local minimizer
- is linear model we will mention it later
- where is some appropriate matrix

### Recall optimization problem

, where is domain of definition,

subject to .

- : denote optimization variable.
- : objective function.
- : constraint functions.

**Optimal solution** has the smallest value
of among all vectors
that satisfy the constraints.

### Minimization X **maximization**

We concentrate only on the problem how to minimize our function. Why? We can suppose that we would like to solve this problem

subject to If we put and then we solve the minimization problem for function we have the result for maximization problem (maximum respectively minimum is in the same point).

### Which method should we use

We have to know which kind of problem we want to solve. We divide optimization problem into many classes according our problem looks like e.g.

- Which kind of function we have (discreet or continuous)?
- Can we use derivations (is our function smooth enough)?
- Do we want to find global or local minimum?
- Have we any constraints?

Amount of possible methods is very high. We can concentrate only on some of them, because to show all methods in one work is impossible. But in practice we should find the most optimal one. This tree can help us to find the best method.

### Cholesky decomposition

In other text we will need Cholesky decomposition in our algorithms. We note only that there is good theory and in which cases we can use this method. If is symmetrical and positive definite, then can be decomposed as

where
is a lower triangular matrix with strictly positive
diagonal entries. This is the Cholesky decomposition.
The Cholesky decomposition is unique. Cholesky
factorizations for positive **semidefinite** matrices are not unique
in general.
There is an algorithm, which can compute Cholesky decomposition!

## Methods and algorithms

In this part we will describe some methods. But we will talk only about some chosen examples. This will not be self-contained theory.

### Generic linesearch method

This will be only basic idea the further description we will see below. We would like to reduce Given an initial guess , let until convergence:

- Find a descent direction at .
- Compute a stepsize using a backtracking-Armijo (later) linesearch along .
- Set and increase a by 1.

We will occupy with method how to find very briefly, but we focus on looking for descent direction

#### Backtracking linesearch

Here is the **Backtracking-Armijo** linesearch method to find , where :

- 1. Given (e.g., ), let and

- 2. Until
- set ,
- where (e.g., )
- and increase by 1.

- 3. Set .

We will not mention why we use this method, but we can mention that this method is very good in empirical tests.

#### Method of steepest descent

The search direction:

called steepest-descent direction (It follows from the theory about gradient). is a descent direction and solves the problem

Any method that uses the steepest-descent direction is a method of steepest descent.

Some properties of steepest descent:

- From the theory follows that this method converges.
- Many other methods resort to steepest descent in bad cases.
- Unfortunately this method is not scale invariant, as re-scaling variables can lead to widely different directions.
- Convergence is usually very (very!) slow (linear).
- Numerically often not convergent at all.

On the picture on the right side we can see contours for the objective function and the iterations generated by the Generic Linesearch steepest-descent method.

#### More general descent methods

Because we saw that method of steepest descent have many imperfections, we will find other possibilities how to solve our problem. We will investigate a generalization of descent methods.

Let be a symmetric, positive definite matrix, and define the search direction so that

Then is a descent direction and solves the problem

If the Hessian is positive definite, and then this method calls Newton's method. General descent methods may be viewed as scaled steepest descent. Convergence is often faster than steepest descent. These methods can be made scale invariant.

On the right picture can we notice contours for the objective function and the iterations generated by the Generic Linesearch Newton method. We see that algorithm converges really faster than in the previous case.

#### Modified Newton methods

If is indefinite, it is usual to solve instead

where

- is chosen so that is sufficiently positive definite.
- when is itself sufficiently positive definite.

Now we show some ways to find , or replace .

- If has the spectral decomposition then

This will shift all the insufficiently small eigenvalues by as little as needed to make the matrix positive definite.

- This is cheaper to compute.
- Modified Cholesky decomposition (We attempt a Cholesky factorization of and alter factors if there is an evidence that the factorization will otherwise fail) :

#### Quasi-Newton methods

Normally we do not know matrix . Now we show various attempts to approximate :

- Finite-difference approximations:

for some small scalar

- Secant approximations: try to ensure the secant condition where and . There are two methods, which try to solve the problem under secant condition.

- 1. Symmetric Rank-1 method (but may be indefinite or even fail):

- 2. BFGS method: (symmetric and positive definite if ):

### Trust region methods

Now we focus on trust region methods. For better understanding we compare it with linear search method. We describe the basic ideas of both algorithms. We start with linear search method:

- Pick descent direction
- Pick stepsize to reduce

And now we are describing Trust-region methods

- Pick step to reduce model of We are looking only for one parameter.
- Accept if decrease in model inherited by
- Otherwise set , refine model. We are not able to find better approach so we have to change our model. Thus, while a linesearch method recovers from a poor step by retreating along a parametric (usually linear) curve, a trust-region method recovers by reconsidering the whole step-finding procedure.

#### Trust region model

To work with should be very complicated, for this we will apply approximation. We can use different models for , but in this work we will use these two, which are built with help of Taylor's polynomial:

- linear model

- quadratic model where have to be symmetric

But if we apply these models it appears some difficulties. They may not resemble if is large models and may be unbounded from below. Linear model is unbounded from below always unless and quadratic model always if is indefinite, possibly if is only positive semi-definite.

#### The trust region

We prevent model from unboundedness by imposing a trust-region constraint for some suitable scalar radius

trust-region subproblem

in theory does not depend on norm in practice it might! Before we have not any constraints on our minimization problem but now we have. For simplicity, concentrate on the second-order (Newton-like) model

and the Euclid's region norm Note:

- is allowed if is positive definite.
- Analysis for other trust-region norms simply adds extra constants in following results.

#### Basic trust-region method

And now we are going to describe a basic algorithm connecting with trust region method.

Given and , until convergence do:

Build the second-order model of Solve the trust-region subproblem to find for which and , and define

- If then our step is
**very successful**().

- Set and where

- Otherwise if then our step is
**successful**, where

- Set and

- Otherwise is our model
**unsuccessful**.

- Set and where

#### Trust region subproblem

From the algorithm above we saw that the main important question will be how to solve trust region subproblem or how to determine for which and . At this point we make only some comments about it and we will pursue it later. We want to solve new aim. It means to achieve as much reduction in the model as would an iteration of steepest descent. We introduce Cauchy point:

Minimize quadratic on line segment very easy! We shall require that

In practice we can do far better than this.

#### Example

We have a function The contours of the original function are shown as dotted line and the contours of trust-region model appear as solid lines. The trust region is black circle. The first plot shows quadratic model with positive definite Hessian. We see that in this case the model fits the function very good. It means the minimum of the model is very near to minimum of our function. The second plot shows linear model about the same point. The third plot shows a quadratic model with indefinite Hessian. The final plot is a quadratic model with positive definite Hessian whose minimizer lies outside the trust region.

#### Some theoretical results

We can guarantee reasonable reduction in the model at the Cauchy point.

**Theorem 1.:** If is the second-order
model and is its Cauchy point within the
trust-region

Since our step does at least as well as Cauchy point (we choose it so) we get this consequence.

**Corollary 2.:** If is the
second-order model, and is an improvement
on the Cauchy point within the trust-region

**Lemma 3.:** Suppose that and that
the true and model Hessians satisfy the bounds for all and for all and some
and Then

where for all .

This lemma shows the maximal difference between model and function. Since we are using a second-order model for which the first two terms are exactly from the Taylor's approximation this imply that difference between model and function will vary like the square of the norm of . So this result is consequence of theory about Taylor's polynomial.

**Lemma 4.:** Suppose that and that
the true and model Hessians satisfy the bounds for all and for all and some
and and that Suppose furthermore that and that

Then iteration is very successful and

This result is fairly intuitive, since when the radius is small the model looks more and more like its first order Taylor's expansion. From this follows that we make good progress because we approximate our function well.

**Lemma 5.:** Suppose that and that
the true and model Hessians satisfy the bounds for all and for all and some
and and that Suppose furthermore that there exists a
constant such that for all . Then

for all .

This result says that the radius uniformly bounded away from zero if the same is true of the sequence of gradients (non-optimal points).

**Lemma 6.:** Suppose that and that
both the true and model Hessian remain bounded for all
Then either

- or

- or

In conclusion, we have seen that trust region methods have a very rich underlying convergence. We now turn to the outstanding practical issue.

#### Trust region subproblem

Now we remind trust region subproblem and we will concern with it. So we want to approximately subject to Our aim is to find so that

We might solve it exactly with use of Newton-like method, or approximately then we apply steepest descent method.

**Theorem 7.:** Any global minimizer of
subject to satisfies the equation

where is positive semi-definite, and If is positive definite, is unique.

These optimality conditions suggest how we might solve the problem. We divide our problem into two cases:

- is positive-semidefinite and the solution to satisfies It means that we already have found the solution. We must not interest in this case further.
- is indefinite or satisfies In this case satisfies and We have non-linear (quadratic) system of equation in and We will concentrate on this case. We should mention that all these results are easy consequences of theorem 7. For instance in the second case is Why is ? In case that is indefinite is everything clear. In case that satisfies we have to find another solution then (for ) from this follows that

Suppose has spectral decomposition

where is matrix of eigenvectors (orthonormal) and diagonal matrix of ranked eigenvalues: We require positive semi-definite Let

From the notes before follows that we require

Define Next formula follows from linear algebra.

If we solve the equation we find the solution of trust region subproblem.

##### Example of solution

A plot of as varies from -8 to 6. Note the poles at the negatives are the eigenvalues of The dashed vertical component corresponds to interior solutions that occur for all larger than roughly 1,15, while the remaining segment indicates boundary solutions.

##### The secular equation

Instead of solve the secular equation

We use the secular equation because it has no poles, takes the smallest values at eigenvalues, it is an analytic function ideal for Newton method and it guarantees global convergence but we need to safeguard to protect Newton from the interior solution cases.

Now we will concentrate on finding solution of the secular equation with use of Newton method:

Let and be given.

Until convergence do:

- Factorize
- Solve
- Solve
- Replace by

#### Conjugate gradient method

We just introduce another algorithm to minimize , which is appropriate when the number of dimension of our function is high.

Given set and

until breakdown or small, iterate:

and increase by 1.

Breakdown is intended to cover the fatal case when (or in practice is close to zero), for which the iteration is undefined, and non-fatal case when for which is unbounded from below.

## Examples

### Maple 8 and Mathematica 4.1

This software use methods mentioned in this presentation (not only).

On the first picture have we polynomial where In Mathematica I used command FindMinimum and this program found the point (0,0), or the point very near from this. In Mathematica is important the point from which we start. In Maple I used command minimize (only on the region, which is shown on the picture). And this program found exact solution.

On the second picture have we function where Mathematica found, like the minimum, the point (0,0) with the value 2, but it started from this point. It wrote: it may be a maximum or saddle point. I tried alternative methods as well, but it stopped, because it came out of the domain of definition (it did not show any point). Maple found all correct solutions (4) on the domain of definition.

Now we have the function where When I used the same command in Mathematica like in the previous cases. It found nothing. It wrote: could not symbolically find the gradient. Maple did not find the solution as well. It wrote: location = false. But when I tested the similar cases in univariate case Maple was more successful.

In the last case have we function where
I tested this function only in
Maple, but the results were
surprisingly good. Maple said
that the minimal value is -1
and that this is when , only
for it did not found anything,
but there is infinitely many where

### Generalized linear model

Now we show other algorithm, where we will solve optimization problem and then we draw some graphs. A generalized linear model (GLM) is a regression model of the form

where is the expected value of the dependent variable is a matrix of explanatory variables, unknown parameter vector, and known link function. We have to find maximum of likelihood function of with respect to for maximum likelihood estimation suppose we, in our case, that the is normal distributed and are i.i.d. will be in our case exponential, quadratic and linear function. It is known that the least squares estimator in the classical linear model is also the maximum-likelihood estimator for normally distributed errors and in this model. In our case is log-likelihood function

but we will use only this function

where

Let we set

We have to solve

We will use Newton-Raphson algorithm. Until convergence do

where is the Hessian of the with respect to all components of is suppose to be positive definite (then algorithm converges).

#### Data

In this part we show previous problematic on a pollution data. We will work with 60 observations from the years 1959 to 1961 from United States. Every observation represents one town. My source was MD BASE. The first explanatory variable describes Average January temperature in degrees of F. The second denotes population per sq. mile in urbanized areas in 1960. The third says how many percent of families has income lower then 3000 $. Dependent variable is Relative hydrocarbon pollution potential.

On axis is dependent variable and on axis where is the matrix of explanatory variables. Now we have (linear regression) (how good is our regression model ) is 0.2924.

In the second case we have , and is 0.5353, so this model is better than previous.

In the last case we have then and is 0.858, so this model is the best one from all our models. It appears that this because of outliers, that is possible, but if we want to know why tends some town to be very polluted we have to use these outliers too. Of course would be better to make analysis without outliers too.

## References

- N.I.M. Gould, S. Leyer, An introduction to algorithms for nonlinear optimization, Frontiers in Numerical Analysis (Durham 2002)
- S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press 2004
- P. Lachout, Matematické programování
- B. P. Flannery, W.H. Press, S. A. Teukolsky, W.T. Vetterling, Numerical Recipes in C,The Art of Scientific Computinion,Second Edition, Cambridge University Press 1988, 1992
- Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg
- Wikipedia different pages