Remarks on 'A comparison of some heuristic optimization methods'

This note is a reply to http://www.portfolioprobe.com/2012/07/23/a-comparison-of-some-heuristic-optimization-methods/

Principles

Pat's example showed that there is apparently a sizeable difference between the different heuristics. Typically, the specific settings and choices (such as how the solution is represented, or for how many iterations an algorithm is run) matter a lot for such methods. Thus, we could be tempted to argue that the authors of the used packages should have 'tried harder' to incorporate better default settings. (Note that I am one of these authors.) But I think this is illusory. Much better would be to educate people how to use such methods properly.

Two principles should guide the use of heuristics: (i) the application matters, and (ii) go experiment. (Actually these principles are valid for any optimisation technique.)

(i) The application matters in the sense that how a good a solution is needs to be judged with respect to what we actually want to achieve. Quantities need to be measured in meaningful units. The example is small and artifial, so it is difficult to evaluate the importance of these errors. Nevertheless, we need constant reminding that we cannot say whether an error of 0.2 matters as long we do not know what we actually measure. (One may argue that the post was only about numerical aspects, and so this principle is off-topic. But I think it is just as off-topic as Morgenstern's 'On the Accuracy of Economic Observations' is off-topic to what econometricians do.)¹

(ii) There is no way to configure a general-purpose heuristic such that it works out-of-the-box for all cases. (Which is not true for a specialised implementation.) The second principle states that this is not a problem: users simply need to give up the idea that they can run a method once with default settings and are done. Rather, users need to run small-scale experiments to see what settings are appropriate.

Default settings

When I had started working on the NMOF package, I talked to Manfred Gilli about whether the methods should have default settings. My preference was No; but Manfred suggested to add 'some numbers' to make it easier for people to get started with these methods. But clearly, to get started, not to use them 'as is'. It is easier to modify – to play around with – a programme that actually runs.

The trouble with heuristics is that it is difficult to come up with generic settings that 'simply work'. This difficulty is worst in the old methods, like Simulated Annealing or Tabu Search, for which the most critical part is the neighbourhood. It is a lesser problem for more–precisely-defined techniques like Differential Evolution (DE). For instance with DE, there is mainly one problem: how long to run the algorithm?

We cannot use the rules of classical optimisation. Even if there was no improvement in the best objective function value for quite some time, we could not be sure that the algorithm should be stopped: heuristics were specially designed to be able to walk away from local minima, and this may take time.

(Well, there are at least two exceptional cases that I can think of. First, if the objective function is some kind of distance for which the best possible value is zero, we can stop the algorithm when we are at zero. Or rather 'sufficiently close to zero'. What is sufficiently close? That totally depends on the problem. Second, sometimes the mechanics of a technique tell us that we can stop the algorithm. For instance, in the standard DE, new solutions are linear combinations of old solutions. Thus, if the population has converged such that all members are the same, we can stop. Again: 'the same' means 'roughly the same'; and what that means depends on your problem.²)

So how long to run the algorithm?

Actually, it's quite simple. We set up the problem, fix the number of iterations to I₁, and run the method, say, 20 times. For each run, we store the solutions (most easily, we look at the objective function values). Now we increase the iterations to I₂, and again run DE 20 times. For each setting, we check the obtained solutions (eg, look at the distribution function of the objective function values). This is not a formal method; but it gives us an idea what the algorithm gives us for specific settings, and thus we can easily find settings that are good enough.

John Tukey is often quoted for having said something like 'the best thing is when the data give you so clear a message that you cannot ignore it'. In my experience, this is exactly what happens when you run such experiments; furthermore, the obtained settings work typically well for the given model even when used with new data.

Solving Pat's example

The exact solution

Pat's example is to maximise a linear combination of portfolio return and portfolio variance, subject to a cardinality constraint. For such a small model, we can write down all possibilities and solve exactly.

We will try all possible combinations of assets, and for each use the quadprog package.

require("quadprog")
require("NMOF")

## 'qp' solves the problem for a given vector
## of return forecasts and a given matrix of
## variance forecasts
qp <- function(mv, cm) {
    d <- mv
    D <- 2*cm
    A <- rbind(1, diag(length(d)))
    bvec <- c(1, rep(0, length(d)))
    solve.QP(D, d, t(A), bvec, meq = 1L)$solution
}

## 'OF' gives the objective function value for a
## portfolio 'w'
OF <- function(w, mv, cm)
    sum(w * mv) - w %*% cm %*% w

We look at the A10s example which seems to be trickier.

ofVals <- numeric(0L)  ## storing the results
assets <- NULL; weights <- NULL
mv <- A10s  ## mean return forecasts
cm <- V10   ## covariance matrix
nA <- 10L   ## number of assets

card <- 1L; sel <- combn(nA, card)
tmpA <- split(sel, rep(seq_len(ncol(sel)),
                       each = nrow(sel)))
tmpW <- vector("list", length = ncol(sel))
names(tmpW) <- names(tmpA) <- paste("c", card, "_",
                                    seq_len(ncol(sel)),
                                    sep = "")
tmpV <- numeric(ncol(sel))
for (i in seq_along(tmpV)) {
    ai <- sel[ ,i]
    tmpW[[i]] <- 1
    tmpV[i] <- OF(tmpW[[i]],
                  mv = as.vector(mv[ai]), cm = cm[ai,ai])
}
cards <- 2:5
for (card in cards) {
    sel <- combn(nA, card)
    tmpA <- split(sel, rep(seq_len(ncol(sel)),
                           each = nrow(sel)))
    tmpW <- vector("list", length = ncol(sel))
    names(tmpW) <- names(tmpA) <- paste("c", card, "_",
                                        seq_len(ncol(sel)),
                                        sep = "")
    tmpV <- numeric(ncol(sel))

    for (i in seq_along(tmpV)) {
        ai <- sel[ ,i]
        tmpW[[i]] <- qp(mv = as.vector(mv[ai]),
                        cm = cm[ai,ai])
        tmpV[i] <- OF(tmpW[[i]],
                      mv = as.vector(mv[ai]),
                      cm = cm[ai,ai])
    }
    assets <-  c(assets, tmpA)
    ofVals <-  c(ofVals, tmpV)
    weights <- c(weights, tmpW)
}

The best objective function value.

round(max(ofVals), 4)

[1] -1.3897

The selected assets:

assets[[which.max(ofVals)]]

[1]  1  2  3  8 10

The weights:

round(weights[[which.max(ofVals)]], 2)

[1] 0.17 0.29 0.27 0.20 0.08

Thus, we have a benchmark solution. Now we can see how other methods fare.

Using Threshold Accepting

We use Threshold Accepting. First, we collect all pieces of information in a list to have them in one place (which makes it easier to pass them explicitly to the optimisation function; it is bad practice to rely on a function finding objects in the global environment).

Data <- list(wmax = 1,  ## the maximal weight
             Kmax = 5,  ## max cardinality
             eps = 0.5/100, ## the step size
             resample = function(x, ...)
                 x[sample.int(length(x), ...)],
             na = 10L,
             cm = cm,
             mv = mv)

Since there are 10 assets, we represent the solution as a numeric vector of length 10. We define a neighbourhood function: it changes a given portfolio slightly; but the new portfolio will remain feasible.

neighbour <- function(w, Data) {
    tol <- 1e-12
    J <- sum(w > tol)
    if (J == Data$Kmax)
        toBuy <- w > tol & w < Data$wmax
    else
        toBuy <- w < Data$wmax
    toSell <- w > tol
    i <- Data$resample(which(toSell), size = 1L)
    j <- Data$resample(which(toBuy),  size = 1L)
    eps <- runif(1) * Data$eps
    eps <- min(w[i], Data$wmax - w[j], eps)
    w[i] <- w[i] - eps
    w[j] <- w[j] + eps
    w
}

We create a random feasible solution.

makex0 <- function() {
    k <- sample.int(5,1)
    x0 <- numeric(10)
    x0[sample(1:10, k)] <- 1/k
    x0
}
x0 <- makex0()

As an example, we move from a random x0 to a random x1.

x1 <- neighbour(x0, Data)
x2 <- neighbour(x1, Data)
data.frame(x0 = round(x0, 4),
           x1 = round(x1, 4),
           x2 = round(x2, 4))

   x0     x1     x2
1   0 0.0000 0.0027
2   0 0.0000 0.0000
3   0 0.0000 0.0000
4   0 0.0000 0.0000
5   0 0.0000 0.0000
6   0 0.0000 0.0000
7   1 0.9952 0.9925
8   0 0.0048 0.0048
9   0 0.0000 0.0000
10  0 0.0000 0.0000

We run TAopt. Note that we redefine the objective function so that it takes Data as an argument and that it switches the sign (ie, we minimise). It takes less than a second.

TAsettings <- list(neighbour = neighbour,
                   x0 = makex0, q = 0.9,
                   printBar = FALSE,
                   printDetail = FALSE)

OFTA <-  function (w, Data)
    -sum(w * Data$mv) + w %*% Data$cm %*% w

system.time(sol <- TAopt(OFTA,
                         algo = TAsettings, Data = Data))

   user  system elapsed
   0.64    0.00    0.64

The final value.

round(sol$OFvalue,4)

1.3897

We run the algorithm 100 times. (To speed up the process, we distribute these restarts.)

restarts <- restartOpt(TAopt, n = 100, OFTA,
                       algo = TAsettings, Data = Data,
                       method = "snow", cl = 4)
summary(sapply(restarts, `[[`, "OFvalue"))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.390   1.390   1.390   1.422   1.468   1.608

Indeed, the solutions differ (even though more than half the runs return the optimum).

As said initially, heuristics can walk away from local minima; so we increase the number of iterations.

TAsettings$nS <- 2000  ## default is 1000

restarts <- restartOpt(TAopt, n = 100, OFTA,
                       algo = TAsettings, Data = Data,
                       method = "snow", cl = 4)
summary(sapply(restarts, `[[`, "OFvalue"))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.390   1.390   1.390   1.403   1.390   1.502

That looks better already. In particular, look at the mean value.

TAsettings$nS <- 4000  ## default is 1000
restarts <- restartOpt(TAopt, n = 100, OFTA,
                       algo = TAsettings, Data = Data,
                       method = "snow", cl = 4)
summary( sapply(restarts, `[[`, "OFvalue"))
quantile(sapply(restarts, `[[`, "OFvalue"), c(0.5, 0.90))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1.390   1.390   1.390   1.390   1.390   1.401
     50%      90%
1.389657 1.389657

Now more than 90% of all runs return the optimum. We could increase the iterations further, but better would be to restart the algorithm a few times and keep the best solution; see Gilli/Maringer/Schumann, 2011, for a discussion and many more details.

Such a best-of strategy can be easily used by setting the best.only argument of restartOpt to TRUE.

TAsettings$nS <- 1000  ## default is 1000
bestof3 <- restartOpt(TAopt, n = 3, OFTA,
                       algo = TAsettings, Data = Data,
                       best.only = TRUE)
bestof3$OFvalue

         [,1]
[1,] 1.389657

Other stuff

DEopt

DEopt implements box constraints (you need to set minmaxConstr to TRUE). But there is a difference how the population is initialised, and how we constrain a solution.

Measuring time

I don't like Windows either :-) (but the performance of sorting algorithms typically varies strongly with the 'sortedness' of the input).

References

Gilli, M., Maringer, D., Schumann, E. (2011) . Numerical Methods and Optimization in Finance. Elsevier/Academic Press. http://nmof.net
the tangled R-code

Footnotes:

¹ What would be wrong with unnecessary numeric precision? Two things: it will typically cost us in terms of computing time or time-to-develop; worse still, too much precision may give us a sense of exactitude that is simply not warranted by financial models.

² I do not talk about numeric precision here, but the precision that is adequate for a given problem. I am not aware of a financial model that needs to be solved to six decimals.

return to main page...