Optimize node-matching objective with unique node using data.table

Question

I have 2 sets of nodes represented by id1 & id2. I have a data.table containing penalties for pairs of nodes -- key (id1, id2), value is the penalty.

How do I effectively range the data.table to pairs of nodes with minimal penalty such that each node (id1 and id2) appears once?

Simple Example:

Input data.table:

dtIn <- data.table(
    id1 = rep(letters[1:3], each=3)
  , id2 = rep(1:3, 3)
  , penalty = 1:9
)
setkey(dtIn, id1, id2)

print(dtIn)
   id1 id2 penalty
1:   a   1       1
2:   a   2       2
3:   a   3       3
4:   b   1       4
5:   b   2       5
6:   b   3       6
7:   c   1       7
8:   c   2       8
9:   c   3       9

Desired output data.table:

   id1 id2 penalty
1:   a   1       1
2:   b   2       5
3:   c   3       9

I know how to implement the algorithm writing a loop: sort by penalty, loop through records and pick each pair in order if none of the nodes was previously matched. See code below.

But of course such a loop runs unbearably slow with my real-size data.

Manual loop function that is logically correct but performs way too poorly:

manualIter <- function(dtIn) {
  setkey(dtIn, penalty, id1, id2) # Enusred ordered by penalty.
  id1Match <- NULL; id2Match <- NULL; pen <- NULL;
  for (i in seq_len(nrow(dtIn))) {
    if (!(dtIn$id1[i] %in% id1Match) && !(dtIn$id2[i] %in% id2Match)) {
      id1Match <- c(id1Match, dtIn$id1[i])
      id2Match <- c(id2Match, dtIn$id2[i])
      pen <- c(pen, dtIn$penalty[i])
    }
  }
  # Build the return data.table for the matching ids.
  dtf <- data.table(id1 = id1Match, id2 = id2Match, penalty = pen)
  setkey(dtf, id1, id2)
  return(dtf)
}

So the question is how to efficiently vectorize this algorithm?

danas.zuokas · Accepted Answer · 2015-12-18 07:15:27Z

0

Updated the answer. I am not sure you can vectorize this. I think it is essentially a recursive problem. My answer is straightforward (given data sort by penalty):

dtOut <- list()
dtOut[[1]] <- dtIn[1]
i <- 2
while(dtIn[, .N] > 0) {
  dtIn <- dtIn[!(id1 == dtOut[[i - 1]][, id1] | id2 == dtOut[[i - 1]][, id2])]
  if(dtIn[, .N] < 1) break
  dtOut[[i]] <- dtIn[1]
  i <- i + 1
}
dtOut <- rbindlist(dtOut)

edited Dec 18, 2015 at 7:15

answered Dec 17, 2015 at 12:38

danas.zuokas

4,6634 gold badges33 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Patrick Over a year ago

Thanks @danas.zuokas. Your code is slightly shorter than mine but on my very small test set, it performs about 1 order of magnitude slower.

danas.zuokas Over a year ago

It can be rbind that slows down things. I will update the answer.

Patrick Over a year ago

Thanks @danas.zuokas. This version is slightly faster than the previous by still far behind my proposed loop. It looks like there is no obvious way to get this logic effectively implemented in R! Resorting to Julia that handles fast explicit loops!

Collectives™ on Stack Overflow

Optimize node-matching objective with unique node using data.table

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related