3

Suppose I have a list of dataframes l. All dataframes are guaranteed to have the same shape and contain the same columns.

I would like to combine the columns of those dataframes with a column-specific element-wise operation, defined in a list of functions comb_funcs, and generate a new dataframe.

For the sake of simplicity, let's assume the list has only 2 dataframes with only 2 columns:

df1 <- tribble(
  ~n_students, ~age,
  100, 16,
  130, 15,
  110, 14
)

df2 <- tribble(
  ~n_students, ~age,
  150, 13,
  60, 12,
  75, 11
)

l <- list(df1, df2)

comb_funcs <- list(
  n_students = sum,
  age = median
)

In this example, the expected output is a new dataframe that contains 2 columns: n_students as the element-wise sum of the n_students columns, and age as the element-wise median of the age columns.

Here is what I tried:

comb_dfs <- function(l, comb_funcs) {
  fin_df <- l[[1]]
  l <- l[2:length(l)]
  for (df in l) {
    for (var in names(comb_funcs)) {
      fin_df[var] <- mapply(
        function(x, y) comb_funcs[[var]](c(x, y)),
        fin_df[[var]],
        df[[var]]
      )
    }
    return(fin_df)
  }
}

In the example above, this returns the expected output:

> comb_dfs(l, comb_funcs)
# A tibble: 3 × 2
  n_students   age
       <dbl> <dbl>
1        250  14.5
2        190  13.5
3        185  12.5

But my function seems cumbersome.

Some cleaner code that uses the tidyverse?

Please notice that the MWE is written with 2 dataframes and 2 columns. But in real life there might be many dataframes with many columns. Therefore, the input of our algorithm must be l (the list of dataframes) and comb_funcs (a named list where names are the columns to process and values are the functions to use).

All dataframes in the list are guaranteed to have the same shape and the same columns. names(comb_funcs) is guaranteed to be a subset of those columns.

16
  • 3
    I believe the "tidy data" approach would be to have all data in one data.frame (with a row-id column and a df-id column). Then this is a simple group-by operation. Commented Oct 16 at 10:48
  • Can you elaborate more? Commented Oct 16 at 10:56
  • 1
    If age and n_students are elements of all input data frames, and you are interested in sum(n_students) and median(age), why do you believe suggested solutions fail? In other words, why so many people suggest answers which do not fit the requirements given in your question (in your opinion)? Commented Oct 16 at 11:42
  • 2
    For the vtc (not me), every day I see what appears to be stochastic or unrelated-subjective (i.e., "I'm in a bad mood") votes to close. I defend the need for anonymous voting at the same time I encourage voters to back up their vote with suggestions or commentary. Commented Oct 16 at 12:05
  • 2
    (more than two columns) with arbitrary names? How do you decide which to choose? This gets confusing! Commented Oct 16 at 13:51

7 Answers 7

7

Here's one approach:

# library(tibble) # as_tibble
# library(purrr)
imap(comb_funcs, function(fun, idx) {
  map(l, `[[`, idx) |>
    transpose() |>
    map(unlist) |>
    sapply(fun)
}) |>
  as_tibble()
# # A tibble: 3 × 2
#   n_students   age
#        <dbl> <dbl>
# 1        250  14.5
# 2        190  13.5
# 3        185  12.5

Note: I use sapply() here because I don't want to assume that map_dbl() is a perfect match; for instance, if you need map_int() or perhaps even map_chr(). If you want to be safer about it, the internal logic can be extended to check class(..) and call the appropriate map_*.


Here's another, in some ways more direct:

# library(dplyr)
bind_rows(l, .id = ".id") |>
  mutate(.by = .id, .rn = row_number()) |>
  reframe(.by = .rn, across(any_of(names(comb_funcs)), ~ comb_funcs[[cur_column()]](.x))) |>
  select(-.rn)
# # A tibble: 3 × 2
#   n_students   age
#        <dbl> <dbl>
# 1        250  14.5
# 2        190  13.5
# 3        185  12.5
Sign up to request clarification or add additional context in comments.

3 Comments

Your first solution returns just one row for me 160 14
The second solution works as expected
Okay, sry about that, there was a late-edit on the code where I inadvertently used imap internally, they should be map, I should know better than to test, copy, paste, then edit the code; that works now. The second chunk is my preferred though, it is imho a clean approach.
5

This one-liner combines the corresponding columns and functions.

library(purrr)

map2_dfc(transpose(l), comb_funcs, ~ pmap(.x, .y) %>% unlist)

## # A tibble: 3 × 2
##   n_students   age
##        <dbl> <dbl>
## 1        250  14.5
## 2        190  13.5
## 3        185  12.5

Notes

There are some issues regarding the actual definition of the problem. We have assumed that the functions in comb_funcs accept multiple arguments and that the tibbles in l and comb_funcs all have exactly the same names in the same order. If not then:

  1. The example functions in the question support both single and multiple arguments so we can't tell what the general case is for your problems. For example, sum(1:2) and sum(1, 2) both work; however, if the functions used in practice only accept single arguments then use the following as the third argument to map2_dfc instead. Either work with the example in the question.
   ~ pmap(.x, \(...) .y(c(...))) %>% unlist
  1. Also, the above works on the inputs shown in the question but if the inputs have extraneous columns or the columns are not in the same order as comb_funcs then first run this which puts the columns in each data frame of l in the same order as comb_funcs keeping only the columns we need. The result of running this with the inputs in the question does not change l so it would not be needed in that case.
   l <- map(l, ~ .x[names(comb_funcs)])

Update

Have completely revised solution.

Comments

4

I know you asked for tidyverse but here's a data.table option

library(data.table)

df = rbindlist(l, idcol = T)
df[, element := 1:.N, by=.(.id)]
df = df[, lapply(names(comb_funcs), \(x) comb_funcs[[x]](get(x))), 
        by = .(element)][, element := NULL]
setnames(df, new = names(comb_funcs))
df

1 Comment

df[, element := 1:.N, by=.(.id)] can be simplified to df[, element := rowid(.id)]
3

{collapse} written with ~SQL syntax

# in simple terms 
l |>
  collapse::dapply(\(i) collapse::ftransform(i, id=collapse::seq_row(i))) |>
  collapse::rowbind() |>
  collapse::fgroup_by(id) |>
  collapse::fsummarise(n_students = collapse::fsum(n_students),
                       age = collapse::fmedian(age))
# A tibble: 3 × 3
     id n_students   age
  <int>      <dbl> <dbl>
1     1        250  14.5
2     2        190  13.5
3     3        185  12.5

The base R equivalent would be along lapply, transform, do.call + rbind, Map, aggreagte + reformulate + match.fun, Reduce, and merge magic.

Edit to add: also works for:

# <...>

df3 <- tribble(
  ~n_students, ~age, ~blah,
  150, 13, 3,
  60, 12, 4,
  75, 11, 8
)

l2 <- list(df1, df2, df3)

l2 |>
  collapse::dapply(\(i) collapse::ftransform(i, id=collapse::seq_row(i))) |>
  collapse::rowbind(fill=TRUE) |>
  collapse::fgroup_by(id) |> 
  collapse::fsummarise(n_students = collapse::fsum(n_students),
                       age = collapse::fmedian(age))

Comments

1

I would combine the data frames and then do a group-wise summary:

funcs <- list(sum, median)
col_names <- list("n_students", "age")

df <- l |>
  map(.x = _,
      .f = ~.x |> 
        rowid_to_column()) |> 
  bind_rows()

map2(.x = col_names,
     .y = funcs,
     .f = ~df |> 
       summarize(across(.x, .y), .by = rowid)) |> 
  reduce(left_join, by = "rowid")

which gives:

# A tibble: 3 × 3
  rowid n_students   age
  <int>      <dbl> <dbl>
1     1        250  14.5
2     2        190  13.5
3     3        185  12.5

4 Comments

Again, you can't hard code column names or operations. They are arbitrary.
It is not clear to me, at which point you know which functions you want to apply to which columns. Can ypu please clarify?
Your inputs must be l (the list of dataframes) and comb_funcs (the functions to combine the columns)
see my updated answer
1

You could first purrr::list_transpose() the frame list to turn it "inside out":

l <- list(
  tibble::tibble(n_students = c(100, 130, 110), age = c(16, 15, 14)),
  tibble::tibble(n_students = c(150, 60, 75), age = c(13, 12, 11))
)

str(l)
#> List of 2
#>  $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ n_students: num [1:3] 100 130 110
#>   ..$ age       : num [1:3] 16 15 14
#>  $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ n_students: num [1:3] 150 60 75
#>   ..$ age       : num [1:3] 13 12 11

purrr::list_transpose(l) |> str()
#> List of 2
#>  $ n_students:List of 2
#>   ..$ : num [1:3] 100 130 110
#>   ..$ : num [1:3] 150 60 75
#>  $ age       :List of 2
#>   ..$ : num [1:3] 16 15 14
#>   ..$ : num [1:3] 13 12 11

From there you can create matrices and apply() function over rows, imap() allows you to access list names, filtering list with keep_at() protects against missing / extra entries in comb_funcs.

library(dplyr)
library(purrr)

comb_funcs <- list(
  n_students = sum,
  age = median
)

comb_dfs <- function(l, comb_funcs){
  list_transpose(l) |> 
    # keep only items that have corresponding function in `comb_funcs`
    keep_at(names(comb_funcs)) |>
    imap(\(cols, col_name) apply(do.call(cbind, cols), 1, comb_funcs[[col_name]])) |> 
    as_tibble()
}
comb_dfs(l, comb_funcs)
#> # A tibble: 3 × 2
#>   n_students   age
#>        <dbl> <dbl>
#> 1        250  14.5
#> 2        190  13.5
#> 3        185  12.5

# comb_funcs without `age` and with an extra function
comb_dfs(l, list(n_students = sum, foo = exp))
#> # A tibble: 3 × 1
#>   n_students
#>        <dbl>
#> 1        250
#> 2        190
#> 3        185

Created on 2025-10-16 with reprex v2.1.1

Comments

1

Here is a base R approach

data.frame(lapply(names(comb_funcs), \(col) {
    apply(sapply(l, \(df) df[[col]]), 1, comb_funcs[[col]]) 
  })) |> setNames(names(comb_funcs))

#   n_students  age
# 1        250 14.5
# 2        190 13.5
# 3        185 12.5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.