working with list columns in r (data.table/dplyr)

Question

I have a large dataset organized in listcolumns (20Gb).

1-Is there any other tricks I can use to speed up the following code? Data.table seems to provide a two-fold increase, but I assume there are other tricks that can be used in similar scenarios.

2-Apart from saveRDS - is there any other faster file-libraries (vroom, fst and fwrite do not seem to support listcols) supporting listcolumns?

3-I tried dt[,.(test=tib_sort[tib_sort[, .I[.N]], stringi::stri_sub(dt$ch, length = 5)],by=id)]but it throws an incorrect number of dimensions error. Is there an option to do by using a listcolumn and automatically have it setDT and key/index?

library(dplyr)
library(purrr)
library(data.table)
library(tictoc)

Toy data

set.seed(123)
tib <-
  tibble(id = 1:20000) %>% mutate(k = map_int(id,  ~ sample(c(10:30), 1)))
tib <-
  tib %>% mutate(tib_df = map(k,  ~ tibble(
    ch = replicate(.x, paste0(
      sample(letters[1:24],
             size = sample(c(10:20), 1)),
      collapse = ""
    )),
    num = sample(1:1e10, .x,replace = F)
  )))

Dplyr

help <- function(df) {
  df <- df %>% top_n(1, num) %>% select(ch)
  stringi::stri_sub(df, length = 5)
}
tic("purrr")
tib <- tib %>% mutate(result = map_chr(tib_df, help))
toc(log = T, quiet = T)

Data.table

dt <- copy(tib)
setDT(dt)
tic("setDT w key")
dt[, tib_df := lapply(tib_df, setDT)]
dt[, tib_sort := lapply(tib_df, function(x)
  setindex(x, "num"))]
toc(log = T, quiet = T)
tic("dt w key")
dt[, result_dt_key := sapply(tib_df, function(x) {
  x[x[, .I[.N]], stringi::stri_sub(ch, length = 5)]
})]
toc(log=T, quiet = T)

Timing

    tic.log(format = T)
[[1]]
[1] "purrr: 25.499 sec elapsed"

[[2]]
[1] "setDT w key: 4.875 sec elapsed"

[[3]]
[1] "dt w key: 12.077 sec elapsed"

EDITS and update after also including unnested versions from both dplyr and data.table

1 purrr: 25.824 sec elapsed          
2 setDT wo key: 2.97 sec elapsed     
3 dt wo key: 13.724 sec elapsed      
4 setDT w key: 1.778 sec elapsed     
5 dt w key: 11.489 sec elapsed       
6 dplyr,unnest: 1.496 sec elapsed    
7 dt,I,unnest: 0.329 sec elapsed     
8 dt, join, unnest: 0.325 sec elapsed

tic("dt, join, unnest")
b <- unnest(tib)
setDT(b)
unnest.J <- b[b[, .(num=max(num)), by = 'id'], on=c('id','num')][,r2:=stringi::stri_sub(ch,length=5)][]
toc(log=T)

 res <- list(unnest.J$r2,tib2$result2,dt$result_dt_key,dt$result_dt,tib$result)
 sapply(res,identical,unnest.I$r2)
[1] TRUE TRUE TRUE TRUE TRUE

So- I guess the lesson learnt is that although listcolumns look deceivingly tempting as data structures for analysis, they are much slooower.

Is it necessary to keep the data structure? I think most of the time it makes more sense to unnest. — JBGruber
– JBGruber, Commented Aug 13, 2019 at 12:46
It is not absolutely necessary, but I find that the structure increases legibility for the analysis. — Misha
– Misha, Commented Aug 13, 2019 at 13:03
You might also want to check that you're getting the same results with each way. I think setindex doesn't sort the data as you're expecting (for which you'd want setkey); and if there are ties at the max num, you'll get multiple rows per df with the join way while the others only return one row. (I see you generate num by sampling without replacement, but maybe it's an issue with the real application.) — Frank
– Frank, Commented Aug 13, 2019 at 20:23
@Frank thanks for the suggestion re equal results. See the edit. — Misha
– Misha, Commented Aug 13, 2019 at 21:04

JBGruber · Accepted Answer · 2019-08-13 14:13:10Z

4

Operating on list columns tends to be slow, as the functions have to iterate over the entries and can't be vectorized. Usually it thus makes sense to unnest list columns:

tic("unnest")
tib2 <- tib %>% 
  tidyr::unnest(tib_df) %>%
  group_by(id) %>% 
  top_n(1, num) %>% 
  mutate(result = stringi::stri_sub(ch, length = 5))
toc(log = T, quiet = T)

Result:

> tic.log(format = T)
[[1]]
[1] "purrr: 39.54 sec elapsed"

[[2]]
[1] "setDT w key: 10.7 sec elapsed"

[[3]]
[1] "dt w key: 19.19 sec elapsed"

[[4]]
[1] "unnest: 1.62 sec elapsed"

With your toy data, the final object is only slightly different. But if it is necessary to get the original form back, you might want to do something like this:

tic("unnest+reattach")
tib2 <- tib %>% 
  tidyr::unnest(tib_df) %>%
  group_by(id) %>% 
  top_n(1, num) %>% 
  mutate(result = stringi::stri_sub(ch, length = 5))

tib3 <- tib %>% 
  mutate(result = tib2$result[match(id, tib2$id)])

toc(log = T, quiet = T)

tic.log(format = T)[[5]]
[1] "unnest+reattach: 1.83 sec elapsed"

edited Aug 13, 2019 at 14:13

answered Aug 13, 2019 at 13:26

JBGruber

12.5k1 gold badge28 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Misha Over a year ago

Thanks for the input. I´ll revert to normal data structures ASAP. See my edit.

Collectives™ on Stack Overflow

working with list columns in r (data.table/dplyr)

Toy data

Dplyr

Data.table

Timing

EDITS and update after also including unnested versions from both dplyr and data.table

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Toy data

Dplyr

Data.table

Timing

EDITS and update after also including unnested versions from both dplyr and data.table

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related