Combine columns of a list of dataframes with a custom function

Question

Suppose I have a list of dataframes l. All dataframes are guaranteed to have the same shape and contain the same columns.

I would like to combine the columns of those dataframes with a column-specific element-wise operation, defined in a list of functions comb_funcs, and generate a new dataframe.

For the sake of simplicity, let's assume the list has only 2 dataframes with only 2 columns:

df1 <- tribble(
  ~n_students, ~age,
  100, 16,
  130, 15,
  110, 14
)

df2 <- tribble(
  ~n_students, ~age,
  150, 13,
  60, 12,
  75, 11
)

l <- list(df1, df2)

comb_funcs <- list(
  n_students = sum,
  age = median
)

In this example, the expected output is a new dataframe that contains 2 columns: n_students as the element-wise sum of the n_students columns, and age as the element-wise median of the age columns.

Here is what I tried:

comb_dfs <- function(l, comb_funcs) {
  fin_df <- l[[1]]
  l <- l[2:length(l)]
  for (df in l) {
    for (var in names(comb_funcs)) {
      fin_df[var] <- mapply(
        function(x, y) comb_funcs[[var]](c(x, y)),
        fin_df[[var]],
        df[[var]]
      )
    }
    return(fin_df)
  }
}

In the example above, this returns the expected output:

> comb_dfs(l, comb_funcs)
# A tibble: 3 × 2
  n_students   age
       <dbl> <dbl>
1        250  14.5
2        190  13.5
3        185  12.5

But my function seems cumbersome.

Some cleaner code that uses the tidyverse?

Please notice that the MWE is written with 2 dataframes and 2 columns. But in real life there might be many dataframes with many columns. Therefore, the input of our algorithm must be l (the list of dataframes) and comb_funcs (a named list where names are the columns to process and values are the functions to use).

All dataframes in the list are guaranteed to have the same shape and the same columns. names(comb_funcs) is guaranteed to be a subset of those columns.

I believe the "tidy data" approach would be to have all data in one data.frame (with a row-id column and a df-id column). Then this is a simple group-by operation. — Roland
– Roland, Commented Oct 16 at 10:48
If age and n_students are elements of all input data frames, and you are interested in sum(n_students) and median(age), why do you believe suggested solutions fail? In other words, why so many people suggest answers which do not fit the requirements given in your question (in your opinion)? — Friede
– Friede, Commented Oct 16 at 11:42
For the vtc (not me), every day I see what appears to be stochastic or unrelated-subjective (i.e., "I'm in a bad mood") votes to close. I defend the need for anonymous voting at the same time I encourage voters to back up their vote with suggestions or commentary. — r2evans
– r2evans, Commented Oct 16 at 12:05
(more than two columns) with arbitrary names? How do you decide which to choose? This gets confusing! — Friede
– Friede, Commented Oct 16 at 13:51

r2evans · Accepted Answer · 2025-10-16 15:16:50Z

7

Here's one approach:

# library(tibble) # as_tibble
# library(purrr)
imap(comb_funcs, function(fun, idx) {
  map(l, `[[`, idx) |>
    transpose() |>
    map(unlist) |>
    sapply(fun)
}) |>
  as_tibble()
# # A tibble: 3 × 2
#   n_students   age
#        <dbl> <dbl>
# 1        250  14.5
# 2        190  13.5
# 3        185  12.5

Note: I use sapply() here because I don't want to assume that map_dbl() is a perfect match; for instance, if you need map_int() or perhaps even map_chr(). If you want to be safer about it, the internal logic can be extended to check class(..) and call the appropriate map_*.

Here's another, in some ways more direct:

# library(dplyr)
bind_rows(l, .id = ".id") |>
  mutate(.by = .id, .rn = row_number()) |>
  reframe(.by = .rn, across(any_of(names(comb_funcs)), ~ comb_funcs[[cur_column()]](.x))) |>
  select(-.rn)
# # A tibble: 3 × 2
#   n_students   age
#        <dbl> <dbl>
# 1        250  14.5
# 2        190  13.5
# 3        185  12.5

edited Oct 16 at 15:16

answered Oct 16 at 11:55

r2evans

167k8 gold badges92 silver badges176 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

robertspierre Oct 16 at 14:00

Your first solution returns just one row for me 160 14

robertspierre Oct 16 at 14:01

The second solution works as expected

r2evans Oct 16 at 15:18

Okay, sry about that, there was a late-edit on the code where I inadvertently used imap internally, they should be map, I should know better than to test, copy, paste, then edit the code; that works now. The second chunk is my preferred though, it is imho a clean approach.

G. Grothendieck · Accepted Answer · 2025-10-18 16:29:42Z

This one-liner combines the corresponding columns and functions.

library(purrr)

map2_dfc(transpose(l), comb_funcs, ~ pmap(.x, .y) %>% unlist)

## # A tibble: 3 × 2
##   n_students   age
##        <dbl> <dbl>
## 1        250  14.5
## 2        190  13.5
## 3        185  12.5

Notes

There are some issues regarding the actual definition of the problem. We have assumed that the functions in comb_funcs accept multiple arguments and that the tibbles in l and comb_funcs all have exactly the same names in the same order. If not then:

The example functions in the question support both single and multiple arguments so we can't tell what the general case is for your problems. For example, sum(1:2) and sum(1, 2) both work; however, if the functions used in practice only accept single arguments then use the following as the third argument to map2_dfc instead. Either work with the example in the question.

   ~ pmap(.x, \(...) .y(c(...))) %>% unlist

Also, the above works on the inputs shown in the question but if the inputs have extraneous columns or the columns are not in the same order as comb_funcs then first run this which puts the columns in each data frame of l in the same order as comb_funcs keeping only the columns we need. The result of running this with the inputs in the question does not change l so it would not be needed in that case.

   l <- map(l, ~ .x[names(comb_funcs)])

Update

Have completely revised solution.

Stephen · Accepted Answer · 2025-10-16 11:11:30Z

4

I know you asked for tidyverse but here's a data.table option

library(data.table)

df = rbindlist(l, idcol = T)
df[, element := 1:.N, by=.(.id)]
df = df[, lapply(names(comb_funcs), \(x) comb_funcs[[x]](get(x))), 
        by = .(element)][, element := NULL]
setnames(df, new = names(comb_funcs))
df

answered Oct 16 at 11:11

Stephen

5358 bronze badges

1 Comment

Gusbourne Oct 18 at 20:26

df[, element := 1:.N, by=.(.id)] can be simplified to df[, element := rowid(.id)]

Friede · Accepted Answer · 2025-10-16 13:41:19Z

{collapse} written with ~SQL syntax

# in simple terms 
l |>
  collapse::dapply(\(i) collapse::ftransform(i, id=collapse::seq_row(i))) |>
  collapse::rowbind() |>
  collapse::fgroup_by(id) |>
  collapse::fsummarise(n_students = collapse::fsum(n_students),
                       age = collapse::fmedian(age))

# A tibble: 3 × 3
     id n_students   age
  <int>      <dbl> <dbl>
1     1        250  14.5
2     2        190  13.5
3     3        185  12.5

The base R equivalent would be along lapply, transform, do.call + rbind, Map, aggreagte + reformulate + match.fun, Reduce, and merge magic.

Edit to add: also works for:

# <...>

df3 <- tribble(
  ~n_students, ~age, ~blah,
  150, 13, 3,
  60, 12, 4,
  75, 11, 8
)

l2 <- list(df1, df2, df3)

l2 |>
  collapse::dapply(\(i) collapse::ftransform(i, id=collapse::seq_row(i))) |>
  collapse::rowbind(fill=TRUE) |>
  collapse::fgroup_by(id) |> 
  collapse::fsummarise(n_students = collapse::fsum(n_students),
                       age = collapse::fmedian(age))

deschen · Accepted Answer · 2025-10-16 11:42:58Z

1

I would combine the data frames and then do a group-wise summary:

funcs <- list(sum, median)
col_names <- list("n_students", "age")

df <- l |>
  map(.x = _,
      .f = ~.x |> 
        rowid_to_column()) |> 
  bind_rows()

map2(.x = col_names,
     .y = funcs,
     .f = ~df |> 
       summarize(across(.x, .y), .by = rowid)) |> 
  reduce(left_join, by = "rowid")

which gives:

# A tibble: 3 × 3
  rowid n_students   age
  <int>      <dbl> <dbl>
1     1        250  14.5
2     2        190  13.5
3     3        185  12.5

edited Oct 16 at 11:42

answered Oct 16 at 11:14

deschen

11.6k5 gold badges32 silver badges69 bronze badges

4 Comments

robertspierre Oct 16 at 11:15

Again, you can't hard code column names or operations. They are arbitrary.

deschen Oct 16 at 11:24

It is not clear to me, at which point you know which functions you want to apply to which columns. Can ypu please clarify?

robertspierre Oct 16 at 11:36

Your inputs must be l (the list of dataframes) and comb_funcs (the functions to combine the columns)

deschen Oct 16 at 11:43

see my updated answer

margusl · Accepted Answer · 2025-10-16 12:34:56Z

You could first purrr::list_transpose() the frame list to turn it "inside out":

l <- list(
  tibble::tibble(n_students = c(100, 130, 110), age = c(16, 15, 14)),
  tibble::tibble(n_students = c(150, 60, 75), age = c(13, 12, 11))
)

str(l)
#> List of 2
#>  $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ n_students: num [1:3] 100 130 110
#>   ..$ age       : num [1:3] 16 15 14
#>  $ : tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
#>   ..$ n_students: num [1:3] 150 60 75
#>   ..$ age       : num [1:3] 13 12 11

purrr::list_transpose(l) |> str()
#> List of 2
#>  $ n_students:List of 2
#>   ..$ : num [1:3] 100 130 110
#>   ..$ : num [1:3] 150 60 75
#>  $ age       :List of 2
#>   ..$ : num [1:3] 16 15 14
#>   ..$ : num [1:3] 13 12 11

From there you can create matrices and apply() function over rows, imap() allows you to access list names, filtering list with keep_at() protects against missing / extra entries in comb_funcs.

library(dplyr)
library(purrr)

comb_funcs <- list(
  n_students = sum,
  age = median
)

comb_dfs <- function(l, comb_funcs){
  list_transpose(l) |> 
    # keep only items that have corresponding function in `comb_funcs`
    keep_at(names(comb_funcs)) |>
    imap(\(cols, col_name) apply(do.call(cbind, cols), 1, comb_funcs[[col_name]])) |> 
    as_tibble()
}
comb_dfs(l, comb_funcs)
#> # A tibble: 3 × 2
#>   n_students   age
#>        <dbl> <dbl>
#> 1        250  14.5
#> 2        190  13.5
#> 3        185  12.5

# comb_funcs without `age` and with an extra function
comb_dfs(l, list(n_students = sum, foo = exp))
#> # A tibble: 3 × 1
#>   n_students
#>        <dbl>
#> 1        250
#> 2        190
#> 3        185

^{Created on 2025-10-16 with reprex v2.1.1}

lailaps · Accepted Answer · 2025-10-16 14:20:14Z

1

Here is a base R approach

data.frame(lapply(names(comb_funcs), \(col) {
    apply(sapply(l, \(df) df[[col]]), 1, comb_funcs[[col]]) 
  })) |> setNames(names(comb_funcs))

#   n_students  age
# 1        250 14.5
# 2        190 13.5
# 3        185 12.5

answered Oct 16 at 14:20

lailaps

11.2k1 gold badge5 silver badges25 bronze badges

Collectives™ on Stack Overflow

Combine columns of a list of dataframes with a custom function

7 Answers 7

3 Comments

Notes

Update

Comments

1 Comment

Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

Notes

Update

Comments

1 Comment

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related