0

I'm needing to iteratively perform joins between two data.tables where the column names are variables which I'm inputting from a function. I've been performing the joins using data.tables 'on' functionality, and am running into issues as the variable column names don't seem to be recognised.

For example, say we have two tables, Table_1 and Table_2, as follows:

require(data.table)
n <- 20
Table_1 <- data.table(A = seq_len(n) + 1,
               B = seq_len(n) + 3,
               C = seq_len(n) + 5)

m <- 15
Table_2 <- data.table(D = seq_len(m) + 7,
               E = seq_len(m) + 9,
               F = seq_len(m) + 12)

I can easily perform joins where I define the columns explicitly. e.g.

Table_2[Table_1,on = .(F = C),sum(D.na.rm = T)]

However, what I need to do is to perform multiple matches on various columns such as this:

require(purrr)    
pmap(.l = CJ(x = c("D","F"),y = c("A","B")),
     .f = function(x,y) Table_2[Table_1,on = .(x = y),sum(C,na.rm = T)])

I receive the following error:

Error in colnamesInt(x, names(on), check_dups = FALSE) : 
  argument specifying columns specify non existing column(s): cols[1]='x' 

I've tried various things, such as:

  1. Enclosing x and y with "eval()" or "noquote"
  2. Putting the pmap function within the data.table, rather than outside as shown above.

Neither approaches work. Any assistance would be greatly appreciated as it will obviously be extremely inefficient to have to write out separate join statements!

Thanks, Phil

EDIT:

It was suggested below that I should consider using the "merge" function. In theory, this would work for the above example, however I didn't mention above that I actually need to use non-equi joins, meaning that, as far as I'm aware, I can't use "merge". In my real-world case, there will be combinations of equi and non-equi joins that i need to map column names to via a function.

I've provided a follow-up example with target output. The example only has two join statements, but I'd need the solution to be flexible enough to handle multiple:

I want the following expression:

pmap(.l = list(x1 = "D",x2 = "A",x3 = "E",x4 = "B"),
    .f = function(x1,x2,x3,x4) (Table_2[Table_1,on = .(x1 = x2,
                             x3 > x4),sum(C,na.rm = T)]))

To give the same output as this:

Table_2[Table_1,on = .(D = A,
                       E > B),sum(C,na.rm = T)]

i.e. 310 in this example.

Thanks again, Phil

4
  • Can you show the expected output given the input? Commented Aug 27, 2020 at 13:40
  • Have you tried teh merge function? The by.x and by.y paramaters ask for strings as column names, which can also be in variables. Commented Aug 27, 2020 at 13:51
  • @sindri_baldur - please see my revised question above with target output. Thanks Commented Aug 27, 2020 at 23:40
  • @cdalitz - thanks for this. It would work in the example above, but I didn't mention the full complexity of my real-life scenario where non-equi joins are required. Please see my amended question with target output. Commented Aug 27, 2020 at 23:41

1 Answer 1

2

I just figured out how to do this through trial and error:

  pmap(.l = list(x1 = "D",x2 = "A",x3 = "E",x4 = "B"),
       .f = function(x1,x2,x3,x4) (Table_2[Table_1,on = 
                            c(paste0(x1,"==",x2),paste0(x3,">",x4)),
                                           sum(C,na.rm = T)]))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.