I have a large dataset organized in listcolumns (20Gb).
1-Is there any other tricks I can use to speed up the following code? Data.table seems to provide a two-fold increase, but I assume there are other tricks that can be used in similar scenarios.
2-Apart from saveRDS - is there any other faster file-libraries (vroom, fst and fwrite do not seem to support listcols) supporting listcolumns?
3-I tried dt[,.(test=tib_sort[tib_sort[, .I[.N]], stringi::stri_sub(dt$ch, length = 5)],by=id)]but it throws an incorrect number of dimensions error. Is there an option to do by using a listcolumn and automatically have it setDT and key/index?
library(dplyr)
library(purrr)
library(data.table)
library(tictoc)
Toy data
set.seed(123)
tib <-
tibble(id = 1:20000) %>% mutate(k = map_int(id, ~ sample(c(10:30), 1)))
tib <-
tib %>% mutate(tib_df = map(k, ~ tibble(
ch = replicate(.x, paste0(
sample(letters[1:24],
size = sample(c(10:20), 1)),
collapse = ""
)),
num = sample(1:1e10, .x,replace = F)
)))
Dplyr
help <- function(df) {
df <- df %>% top_n(1, num) %>% select(ch)
stringi::stri_sub(df, length = 5)
}
tic("purrr")
tib <- tib %>% mutate(result = map_chr(tib_df, help))
toc(log = T, quiet = T)
Data.table
dt <- copy(tib)
setDT(dt)
tic("setDT w key")
dt[, tib_df := lapply(tib_df, setDT)]
dt[, tib_sort := lapply(tib_df, function(x)
setindex(x, "num"))]
toc(log = T, quiet = T)
tic("dt w key")
dt[, result_dt_key := sapply(tib_df, function(x) {
x[x[, .I[.N]], stringi::stri_sub(ch, length = 5)]
})]
toc(log=T, quiet = T)
Timing
tic.log(format = T)
[[1]]
[1] "purrr: 25.499 sec elapsed"
[[2]]
[1] "setDT w key: 4.875 sec elapsed"
[[3]]
[1] "dt w key: 12.077 sec elapsed"
EDITS and update after also including unnested versions from both dplyr and data.table
1 purrr: 25.824 sec elapsed
2 setDT wo key: 2.97 sec elapsed
3 dt wo key: 13.724 sec elapsed
4 setDT w key: 1.778 sec elapsed
5 dt w key: 11.489 sec elapsed
6 dplyr,unnest: 1.496 sec elapsed
7 dt,I,unnest: 0.329 sec elapsed
8 dt, join, unnest: 0.325 sec elapsed
tic("dt, join, unnest")
b <- unnest(tib)
setDT(b)
unnest.J <- b[b[, .(num=max(num)), by = 'id'], on=c('id','num')][,r2:=stringi::stri_sub(ch,length=5)][]
toc(log=T)
res <- list(unnest.J$r2,tib2$result2,dt$result_dt_key,dt$result_dt,tib$result)
sapply(res,identical,unnest.I$r2)
[1] TRUE TRUE TRUE TRUE TRUE
So- I guess the lesson learnt is that although listcolumns look deceivingly tempting as data structures for analysis, they are much slooower.

unnest.