Backtests with time-varying asset universes

In this note we'll see how can deal with a particular case of missing values: when certain assets are available only at certain times.

We first get some data: time-series of industry portfolios from Kenneth French's website at https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ The dataset comprises 30 series of daily data, and we use a subset that starts in January 1990.

library("NMOF")
library("zoo")
P <- French(dest.dir = tempdir(),
              "30_Industry_Portfolios_daily_CSV.zip",
              price.series = TRUE,            
              na.rm = TRUE)

P <- zoo(P, as.Date(row.names(P)))
P <- window(P, start = as.Date("1990-1-1"))
str(P)
‘zoo’ series from 1990-01-02 to 2020-08-31
  Data: num [1:7727, 1:30] 808 803 797 791 791 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:7727] "1990-01-02" "1990-01-03" "1990-01-04" ...
  ..$ : chr [1:30] "Food" "Beer" "Smoke" "Games" ...
  Index:  Date[1:7727], format: "1990-01-02" "1990-01-03" ...

Actually, the data are complete: there are no missing values.

any(is.na(P))
[1] FALSE

So let us make them incomplete: in series 16 to 30, we remove all data before January 2000.

window(P[, 16:30], end = as.Date("1999-12-31")) <- NA

The key feature of btest to handle such data is this: if an asset is not selected (i.e. has a zero position), it is not required for valuing the position, and so it can be missing. Suppose we wanted to to simulate a 50/50 investment in only the first two series (which, we know, are complete). With btest, we could do it as follows.

library("PMwR")
bt <- btest(prices = list(coredata(P)),
            timestamp = index(P),
            signal = function() {
              w <- numeric(ncol(Close()))
              w[1:2] <- c(0.5, 0.5)
              w
            },
            do.signal = "lastofquarter",
            convert.weights = TRUE,
            initial.cash = 100)
head(journal(bt), n = 10, by = FALSE)

As you can see, the function does not complain. If you check the journal, you'll find that all transactions have been in Food and Beer, the first two industries.

    instrument   timestamp         amount      price
1         Food  1990-03-30   0.0659531693   759.8573
2         Beer  1990-03-30   0.0335054119  1481.5517
3         Food  1990-06-29   0.0026870029   843.8351
4         Beer  1990-06-29  -0.0011305346  1775.1047
5         Food  1990-09-28  -0.0014578071   775.9197
6         Beer  1990-09-28   0.0007077629  1575.3859
7         Food  1990-12-31   0.0008239410   882.5049
8         Beer  1990-12-31  -0.0003957095  1824.9844
9         Food  1991-03-28  -0.0004120411  1081.1665
10        Beer  1991-03-28   0.0001984854  2237.6230

10 transactions

Now we can start the actual example. The aim in this exercise is to compute a minimum-variance portfolio over all available assets. We begin by defining when certain assets were available, and placing this information in a data-frame active.

active <- data.frame(instrument = colnames(P),
                     start = c(rep(as.Date("1990-1-1"), 15),
                               rep(as.Date("2001-1-1"), 15)),
                     end = tail(index(P), 1))
active
   instrument      start        end
1        Food 1990-01-01 2020-08-31
2        Beer 1990-01-01 2020-08-31
3       Smoke 1990-01-01 2020-08-31
4       Games 1990-01-01 2020-08-31
5       Books 1990-01-01 2020-08-31
6       Hshld 1990-01-01 2020-08-31
7       Clths 1990-01-01 2020-08-31
8        Hlth 1990-01-01 2020-08-31
9       Chems 1990-01-01 2020-08-31
10      Txtls 1990-01-01 2020-08-31
11      Cnstr 1990-01-01 2020-08-31
12      Steel 1990-01-01 2020-08-31
13      FabPr 1990-01-01 2020-08-31
14      ElcEq 1990-01-01 2020-08-31
15      Autos 1990-01-01 2020-08-31
16      Carry 2001-01-01 2020-08-31
17      Mines 2001-01-01 2020-08-31
18       Coal 2001-01-01 2020-08-31
19        Oil 2001-01-01 2020-08-31
20       Util 2001-01-01 2020-08-31
21      Telcm 2001-01-01 2020-08-31
22      Servs 2001-01-01 2020-08-31
23      BusEq 2001-01-01 2020-08-31
24      Paper 2001-01-01 2020-08-31
25      Trans 2001-01-01 2020-08-31
26      Whlsl 2001-01-01 2020-08-31
27      Rtail 2001-01-01 2020-08-31
28      Meals 2001-01-01 2020-08-31
29        Fin 2001-01-01 2020-08-31
30      Other 2001-01-01 2020-08-31

Note that we did set start to 2001, not 2000. You'll see shortly, why.

Now for the signal function. It receives active as an argument.

mv <- function(active) {

  ## find those assets that are active
  ## ==> 'j' is a logical vector that
  ##         indicates the active assets
  j <- Timestamp() >= active[["start"]] &
       Timestamp() <= active[["end"]]


  ## get last 260 prices of active assets and compute
  ## variance--covariance matrix
  P.j <- Close(n = 260)[, j]
  R.j <- returns(P.j)
  S <- cov(R.j)


  ## compute minimum-variance weights
  w.j <- NMOF::minvar(S, wmin = 0, wmax = 0.10)


  ## create a zero-vector with length equal to number
  ## of total assets and assign the weights at
  ## appropriate positions
  w <- numeric(length(j))
  w[j] <- w.j
  w  
}

Now you see why we used 2001 as the start date for series 16 to 30: we'll use one year of historical data to compute the variance-covariance matrix. (Note that there are better ways to come up with forecasts of the variance-covariance matrix, e.g. methods that apply shrinkage. But the purpose of this note is to show how to handle missing values in btest, not to discuss empirical methods.)

We call btest.

bt.mv <- btest(prices = list(coredata(P)),
            timestamp = index(P),
            signal = mv,
            do.signal = "lastofquarter",
            convert.weights = TRUE,
            initial.cash = 100,
            active = active,
            b = 260)
bt.mv
initial wealth 100  =>  final wealth  1652.74 
Total return   1552.7%

The backtest runs without problems. As an example, let us check trades in industry Oil.

head(journal(bt.mv)["Oil"], 5)
   instrument   timestamp         amount     price
1         Oil  2001-03-30   0.0104934366  2656.871
2         Oil  2001-06-29  -0.0003607878  2709.119
3         Oil  2001-09-28   0.0011873853  2383.685
4         Oil  2001-12-31  -0.0043576713  2549.018
5         Oil  2002-03-28  -0.0037902744  2807.207

5 transactions

As expected, the first trades occur only in 2001.

A final remark: we would not have needed to prepare active upfront. Instead, we could have checked for missing values in the signal function.

mv_with_NA_check <- function() {

  ## fetch data and check for missing values
  P <- Close(n = 260)
  j <- !apply(P, 2, anyNA)

  ## get last 250 prices of active assets and compute
  ## variance--covariance matrix
  P.j <- P[, j]
  R.j <- returns(P.j)
  S <- cov(R.j)

  ## compute minimum-variance weights
  w.j <- NMOF::minvar(S, wmin = 0, wmax = 0.10)

  ## create a zero-vector with length equal to number
  ## of total assets and assign the weights at
  ## appropriate positions
  w <- numeric(length(j))
  w[j] <- w.j
  w  
}
bt.mv2 <- btest(prices = list(coredata(P)),
            timestamp = index(P),
            signal = mv_with_NA_check,
            do.signal = "lastofquarter",
            convert.weights = TRUE,
            initial.cash = 100,
            b = 260)
bt.mv2
head(journal(bt.mv)["Oil"], 5)
initial wealth 100  =>  final wealth  1652.74 
Total return   1552.7%

   instrument   timestamp         amount     price
1         Oil  2001-03-30   0.0104934366  2656.871
2         Oil  2001-06-29  -0.0003607878  2709.119
3         Oil  2001-09-28   0.0011873853  2383.685
4         Oil  2001-12-31  -0.0043576713  2549.018
5         Oil  2002-03-28  -0.0037902744  2807.207

5 transactions

We get the same results. But defining an explicit list is more, well, explicit. Which is often a good thing when analysing data; notably, because it sets an expectation that those active time-series don't have missing values.