I am trying to group uncorrelated variables into subsets. So, using the correlation matrix, I check each variable to see the correlation. If correlation is more then a threshold I will create a new list, else I will add it to the current list. At the end, in each subset the variables are not correlated. I have written the below code and it works fine. However, when the number of variables are high (> 20,000), it takes more than two hours to run. Is there any suggestion to make it faster? or do some operations in parallel?
corr <- matrix(c(1,0.9,0,0.83,0.9,0.9,1,0.2,0.9,0.1,0,0.2,1,0.1,0.9,0.83,0.9,0.1,1,0.9,0.9,0.1,0.9,0.9,1), 5,5, byrow = T)
rownames(corr) <- colnames(corr) <- LETTERS[1:5]
#corr <- cor(t(dataset)) %>% abs()
vars <- rownames(corr)
list_data[[1]] <- vars[1]
for(i in 2:length(vars)){
message(vars[i])
added <- 1
for(j in 1:length(list_data)){
cur_list <- list_data[[j]]
flag <- 1
for(k in 1:length(cur_list)){
corr_data <- corr[vars[i], cur_list[k]]
if(corr_data >= 0.8){
flag <- 0
break
}
}
if(flag == 0) next
else {
list_data[[j]] <- c(cur_list, vars[i])
added <- 0
break
}
}
if(added == 1) list_data[[j+1]] <- vars[i]
}
I have added an example input data including five variables. In my data, the number of variables are around 21,000, which makes the code really slow.