Sampling with a rank-deficient variance–covariance matrix

We wish to create m random-samples of n Gaussian variates that follow a specific variance–covariance matrix Σ. The canonical way to do that (and, unfortunately, often the only one that is described in textbooks) is to

  1. create an m × n matrix X of standard Gaussian variates without any specific correlation,
  2. compute the Cholesky factor of the desired variance–covariance matrix, and
  3. post-multiply X by this factor.

If Σ is rank-deficient, we can use a factorisation of Σ that does not require full-rank, such as the eigenvalue decomposition (see GMS).

The downside of an alternative decomposition is that it might be expensive. Also, since we want to sample, we would create more random numbers than we actually need.

Let us start with an example. We call our original data-matrix M.

nc <- 3
nr <- 10
M <- array(rnorm(nr*nc), dim = c(nr, nc))

C <- array(0.5, dim = c(nc,nc))
diag(C) <- 1
M <- M %*% chol(C)
M <- M[ ,c(1,1,1,2,3)]
head(M, 3)
 
      [,1]   [,2]   [,3]   [,4]  [,5]
[1,] -0.694 -0.694 -0.694 -0.261 0.446
[2,] -1.808 -1.808 -1.808  1.381 1.581
[3,]  0.687  0.687  0.687  1.316 2.186

M is rank-deficient by construction, since columns 2 and 3 are actually copies of column 1. If we were to compute a correlation matrix from these columns, we would only need a three by three matrix.

The function colSubset will do all the work (this function is incluced in the NMOF package).

colSubset <- function(x) {

    nr <- dim(x)[1L]
    nc <- dim(x)[2L]    
    QR <- qr(x)

    if ((qrank <- QR$rank) == nc) {
        ans <- list(columns = seq_len(nc),
                    multiplier = diag(nc))
    } else {
        cols  <- QR$pivot[seq_len(qrank)]
        cols_ <- QR$pivot[(qrank + 1L):nc]
        D <- array(0, dim = c(qrank, nc))
        D[cbind(seq_along(cols), cols)] <- 1
        D[ ,cols_] <- qr.solve(x[ ,cols], x[ ,cols_])
        ans <- list(columns = cols,
                    multiplier = D)
    }
    ans
}
(tmp <- colSubset(M))
$columns
[1] 1 4 5

$multiplier
     [,1]      [,2]      [,3] [,4] [,5]
[1,]    1  1.00e+00  1.00e+00    0    0
[2,]    0 -1.22e-17 -1.22e-17    1    0
[3,]    0 -5.83e-17 -5.83e-17    0    1

It returns a list of two components: a full-rank subset of columns that spans the column space of the original matrix and a matrix multiplier.

To create random numbers that match the correlation structure of M, we create random numbers as in the textbook version, but we use a correlation matrix that uses only the column subset. Then we postmultiply by multiplier.

C <- cor(M[ ,tmp$columns])  
nc <- ncol(C)
nr <- 100
X <- array(rnorm(nr*nc), dim = c(nr, nc))
X <- X %*% chol(C)
X <- X %*% tmp$multiplier
round(cor(X), 1)
 
     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0  1.0  1.0 -0.1  0.3
[2,]  1.0  1.0  1.0 -0.1  0.3
[3,]  1.0  1.0  1.0 -0.1  0.3
[4,] -0.1 -0.1 -0.1  1.0  0.3
[5,]  0.3  0.3  0.3  0.3  1.0

(the tangled R-code)

return to main page...

return to R page...