Group uncorrelated variables into subsets using correlation matrix

Question

I am trying to group uncorrelated variables into subsets. So, using the correlation matrix, I check each variable to see the correlation. If correlation is more then a threshold I will create a new list, else I will add it to the current list. At the end, in each subset the variables are not correlated. I have written the below code and it works fine. However, when the number of variables are high (> 20,000), it takes more than two hours to run. Is there any suggestion to make it faster? or do some operations in parallel?

corr <- matrix(c(1,0.9,0,0.83,0.9,0.9,1,0.2,0.9,0.1,0,0.2,1,0.1,0.9,0.83,0.9,0.1,1,0.9,0.9,0.1,0.9,0.9,1), 5,5, byrow = T)

rownames(corr) <- colnames(corr) <- LETTERS[1:5]
#corr <- cor(t(dataset)) %>% abs()
vars <- rownames(corr)

list_data[[1]] <- vars[1]
for(i in 2:length(vars)){
  message(vars[i])
  added <- 1
  for(j in 1:length(list_data)){
    cur_list <- list_data[[j]]
    flag <- 1
    for(k in 1:length(cur_list)){
      corr_data <- corr[vars[i], cur_list[k]]
      if(corr_data >= 0.8){
        flag <- 0
        break
      }
    }
    if(flag == 0) next
    else {
      list_data[[j]] <- c(cur_list, vars[i])
      added <- 0
      break
    }
  }
  if(added == 1) list_data[[j+1]] <- vars[i]
}

I have added an example input data including five variables. In my data, the number of variables are around 21,000, which makes the code really slow.

minem · Accepted Answer · 2023-01-31 08:51:04Z

rownames(corr) <- colnames(corr) <- 1:ncol(corr)
vars <- rownames(corr)
vars <- as.integer(vars)

list_data2 <- list()
list_data2[[1]] <- vars[1]

t1 <- proc.time()
for (i in 2:length(vars)) {
  added <- 1L
  corr2 <- corr[vars[i], ]
  for (j in 1:length(list_data2)) {
    cur_list <- list_data2[[j]]
    flag <- 1L
    for (k in 1:length(cur_list)) {
      corr_data <- corr2[cur_list[k]]
      if (corr_data >= 0.8) {
        flag <- 0L
        break
      }
    }
    if (flag == 0L) next
    else {
      list_data2[[j]] <- c(cur_list, vars[i])
      added <- 0L
      break
    }
  }
  if (added == 1L) list_data2[[j + 1L]] <- vars[i]
}

don't use col/row names to subset matrix, use integers (positions of cols/rows)
we can subset row in outer loop (line: corr2 <- corr[vars[i], ])
afterwards we can get names from indexes, if needed:

your_names <- paste0('v', 1:n) # example
name_list <- lapply(list_data2, function(x) your_names[x])

Update

Another huge improvement is to do your comparison outside loop & remove names of resulting matrix, because of that matrix/vectors subsetting is much faster.

vars <- 1:ncol(corr)

list_data3 <- list()
list_data3[[1]] <- vars[1]

t1 <- proc.time()
compar <- unname(corr) >= 0.8 # do comparison outside loop

for (i in 2:length(vars)) {
  added <- 1L
  corr2 <- compar[vars[i], ]
  for (j in 1:length(list_data3)) {
    cur_list <- list_data3[[j]]
    flag <- 1L
    for (k in seq_along(cur_list)) { # little bit faster
      corr_data <- corr2[cur_list[k]]
      if (corr_data) {
        flag <- 0L
        break
      }
    }
    if (flag == 0L) next
    else {
      list_data3[[j]] <- c(cur_list, vars[i])
      added <- 0L
      break
    }
  }
  if (added == 1L) list_data3[[j + 1L]] <- vars[i]
}
t2 <- proc.time()
(t2 - t1)[3] # ~10 sec for 20k*20k symmetric matrix

Stack Exchange Network

Group uncorrelated variables into subsets using correlation matrix

1 Answer 1

Update

You must log in to answer this question.

Hot Network Questions

Group uncorrelated variables into subsets using correlation matrix

1 Answer 1

Update

You must log in to answer this question.

Related

Hot Network Questions