3

I have a large dataset organized in listcolumns (20Gb).

1-Is there any other tricks I can use to speed up the following code? Data.table seems to provide a two-fold increase, but I assume there are other tricks that can be used in similar scenarios.

2-Apart from saveRDS - is there any other faster file-libraries (vroom, fst and fwrite do not seem to support listcols) supporting listcolumns?

3-I tried dt[,.(test=tib_sort[tib_sort[, .I[.N]], stringi::stri_sub(dt$ch, length = 5)],by=id)]but it throws an incorrect number of dimensions error. Is there an option to do by using a listcolumn and automatically have it setDT and key/index?

library(dplyr)
library(purrr)
library(data.table)
library(tictoc)

Toy data

set.seed(123)
tib <-
  tibble(id = 1:20000) %>% mutate(k = map_int(id,  ~ sample(c(10:30), 1)))
tib <-
  tib %>% mutate(tib_df = map(k,  ~ tibble(
    ch = replicate(.x, paste0(
      sample(letters[1:24],
             size = sample(c(10:20), 1)),
      collapse = ""
    )),
    num = sample(1:1e10, .x,replace = F)
  )))

Dplyr

help <- function(df) {
  df <- df %>% top_n(1, num) %>% select(ch)
  stringi::stri_sub(df, length = 5)
}
tic("purrr")
tib <- tib %>% mutate(result = map_chr(tib_df, help))
toc(log = T, quiet = T)

Data.table

dt <- copy(tib)
setDT(dt)
tic("setDT w key")
dt[, tib_df := lapply(tib_df, setDT)]
dt[, tib_sort := lapply(tib_df, function(x)
  setindex(x, "num"))]
toc(log = T, quiet = T)
tic("dt w key")
dt[, result_dt_key := sapply(tib_df, function(x) {
  x[x[, .I[.N]], stringi::stri_sub(ch, length = 5)]
})]
toc(log=T, quiet = T)

Timing

    tic.log(format = T)
[[1]]
[1] "purrr: 25.499 sec elapsed"

[[2]]
[1] "setDT w key: 4.875 sec elapsed"

[[3]]
[1] "dt w key: 12.077 sec elapsed"

EDITS and update after also including unnested versions from both dplyr and data.table

1 purrr: 25.824 sec elapsed          
2 setDT wo key: 2.97 sec elapsed     
3 dt wo key: 13.724 sec elapsed      
4 setDT w key: 1.778 sec elapsed     
5 dt w key: 11.489 sec elapsed       
6 dplyr,unnest: 1.496 sec elapsed    
7 dt,I,unnest: 0.329 sec elapsed     
8 dt, join, unnest: 0.325 sec elapsed

tic("dt, join, unnest")
b <- unnest(tib)
setDT(b)
unnest.J <- b[b[, .(num=max(num)), by = 'id'], on=c('id','num')][,r2:=stringi::stri_sub(ch,length=5)][]
toc(log=T)

 res <- list(unnest.J$r2,tib2$result2,dt$result_dt_key,dt$result_dt,tib$result)
 sapply(res,identical,unnest.I$r2)
[1] TRUE TRUE TRUE TRUE TRUE

So- I guess the lesson learnt is that although listcolumns look deceivingly tempting as data structures for analysis, they are much slooower.

enter image description here

4
  • Is it necessary to keep the data structure? I think most of the time it makes more sense to unnest. Commented Aug 13, 2019 at 12:46
  • It is not absolutely necessary, but I find that the structure increases legibility for the analysis. Commented Aug 13, 2019 at 13:03
  • 2
    You might also want to check that you're getting the same results with each way. I think setindex doesn't sort the data as you're expecting (for which you'd want setkey); and if there are ties at the max num, you'll get multiple rows per df with the join way while the others only return one row. (I see you generate num by sampling without replacement, but maybe it's an issue with the real application.) Commented Aug 13, 2019 at 20:23
  • 1
    @Frank thanks for the suggestion re equal results. See the edit. Commented Aug 13, 2019 at 21:04

1 Answer 1

4

Operating on list columns tends to be slow, as the functions have to iterate over the entries and can't be vectorized. Usually it thus makes sense to unnest list columns:

tic("unnest")
tib2 <- tib %>% 
  tidyr::unnest(tib_df) %>%
  group_by(id) %>% 
  top_n(1, num) %>% 
  mutate(result = stringi::stri_sub(ch, length = 5))
toc(log = T, quiet = T)

Result:

> tic.log(format = T)
[[1]]
[1] "purrr: 39.54 sec elapsed"

[[2]]
[1] "setDT w key: 10.7 sec elapsed"

[[3]]
[1] "dt w key: 19.19 sec elapsed"

[[4]]
[1] "unnest: 1.62 sec elapsed"

With your toy data, the final object is only slightly different. But if it is necessary to get the original form back, you might want to do something like this:

tic("unnest+reattach")
tib2 <- tib %>% 
  tidyr::unnest(tib_df) %>%
  group_by(id) %>% 
  top_n(1, num) %>% 
  mutate(result = stringi::stri_sub(ch, length = 5))

tib3 <- tib %>% 
  mutate(result = tib2$result[match(id, tib2$id)])

toc(log = T, quiet = T)

tic.log(format = T)[[5]]
[1] "unnest+reattach: 1.83 sec elapsed"
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the input. I´ll revert to normal data structures ASAP. See my edit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.