Sampling with a rank-deficient variance–covariance matrix
We wish to create m random-samples of n Gaussian variates that follow a specific variance–covariance matrix Σ. The canonical way to do that (and, unfortunately, often the only one that is described in textbooks) is to
- create an m × n matrix X of standard Gaussian variates without any specific correlation,
- compute the Cholesky factor of the desired variance–covariance matrix, and
- post-multiply X by this factor.
If Σ is rank-deficient, we can use a factorisation of Σ that does not require full-rank, such as the eigenvalue decomposition (see GMS).
The downside of an alternative decomposition is that it might be expensive. Also, since we want to sample, we would create more random numbers than we actually need.
Let us start with an example. We call our original data-matrix M.
nc <- 3 nr <- 10 M <- array(rnorm(nr*nc), dim = c(nr, nc)) C <- array(0.5, dim = c(nc,nc)) diag(C) <- 1 M <- M %*% chol(C) M <- M[ ,c(1,1,1,2,3)] head(M, 3)
[,1] [,2] [,3] [,4] [,5] [1,] -0.694 -0.694 -0.694 -0.261 0.446 [2,] -1.808 -1.808 -1.808 1.381 1.581 [3,] 0.687 0.687 0.687 1.316 2.186
M is rank-deficient by construction, since columns 2 and 3 are actually copies of column 1. If we were to compute a correlation matrix from these columns, we would only need a three by three matrix.
The function colSubset
will do all the work (this function is
incluced in the NMOF package).
colSubset <- function(x) { nr <- dim(x)[1L] nc <- dim(x)[2L] QR <- qr(x) if ((qrank <- QR$rank) == nc) { ans <- list(columns = seq_len(nc), multiplier = diag(nc)) } else { cols <- QR$pivot[seq_len(qrank)] cols_ <- QR$pivot[(qrank + 1L):nc] D <- array(0, dim = c(qrank, nc)) D[cbind(seq_along(cols), cols)] <- 1 D[ ,cols_] <- qr.solve(x[ ,cols], x[ ,cols_]) ans <- list(columns = cols, multiplier = D) } ans }
(tmp <- colSubset(M))
$columns [1] 1 4 5 $multiplier [,1] [,2] [,3] [,4] [,5] [1,] 1 1.00e+00 1.00e+00 0 0 [2,] 0 -1.22e-17 -1.22e-17 1 0 [3,] 0 -5.83e-17 -5.83e-17 0 1
It returns a list of two components: a full-rank subset of columns
that spans the column space of the original matrix and a matrix
multiplier
.
To create random numbers that match the correlation structure of M,
we create random numbers as in the textbook version, but we use a
correlation matrix that uses only the column subset. Then we
postmultiply by multiplier
.
C <- cor(M[ ,tmp$columns]) nc <- ncol(C) nr <- 100 X <- array(rnorm(nr*nc), dim = c(nr, nc)) X <- X %*% chol(C) X <- X %*% tmp$multiplier round(cor(X), 1)
[,1] [,2] [,3] [,4] [,5] [1,] 1.0 1.0 1.0 -0.1 0.3 [2,] 1.0 1.0 1.0 -0.1 0.3 [3,] 1.0 1.0 1.0 -0.1 0.3 [4,] -0.1 -0.1 -0.1 1.0 0.3 [5,] 0.3 0.3 0.3 0.3 1.0