3

I have 2 sets of nodes represented by id1 & id2. I have a data.table containing penalties for pairs of nodes -- key (id1, id2), value is the penalty.

How do I effectively range the data.table to pairs of nodes with minimal penalty such that each node (id1 and id2) appears once?

Simple Example:

Input data.table:

dtIn <- data.table(
    id1 = rep(letters[1:3], each=3)
  , id2 = rep(1:3, 3)
  , penalty = 1:9
)
setkey(dtIn, id1, id2)

print(dtIn)
   id1 id2 penalty
1:   a   1       1
2:   a   2       2
3:   a   3       3
4:   b   1       4
5:   b   2       5
6:   b   3       6
7:   c   1       7
8:   c   2       8
9:   c   3       9

Desired output data.table:

   id1 id2 penalty
1:   a   1       1
2:   b   2       5
3:   c   3       9

I know how to implement the algorithm writing a loop: sort by penalty, loop through records and pick each pair in order if none of the nodes was previously matched. See code below.

But of course such a loop runs unbearably slow with my real-size data.

Manual loop function that is logically correct but performs way too poorly:

manualIter <- function(dtIn) {
  setkey(dtIn, penalty, id1, id2) # Enusred ordered by penalty.
  id1Match <- NULL; id2Match <- NULL; pen <- NULL;
  for (i in seq_len(nrow(dtIn))) {
    if (!(dtIn$id1[i] %in% id1Match) && !(dtIn$id2[i] %in% id2Match)) {
      id1Match <- c(id1Match, dtIn$id1[i])
      id2Match <- c(id2Match, dtIn$id2[i])
      pen <- c(pen, dtIn$penalty[i])
    }
  }
  # Build the return data.table for the matching ids.
  dtf <- data.table(id1 = id1Match, id2 = id2Match, penalty = pen)
  setkey(dtf, id1, id2)
  return(dtf)
}

So the question is how to efficiently vectorize this algorithm?

1 Answer 1

0

Updated the answer. I am not sure you can vectorize this. I think it is essentially a recursive problem. My answer is straightforward (given data sort by penalty):

dtOut <- list()
dtOut[[1]] <- dtIn[1]
i <- 2
while(dtIn[, .N] > 0) {
  dtIn <- dtIn[!(id1 == dtOut[[i - 1]][, id1] | id2 == dtOut[[i - 1]][, id2])]
  if(dtIn[, .N] < 1) break
  dtOut[[i]] <- dtIn[1]
  i <- i + 1
}
dtOut <- rbindlist(dtOut)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @danas.zuokas. Your code is slightly shorter than mine but on my very small test set, it performs about 1 order of magnitude slower.
It can be rbind that slows down things. I will update the answer.
Thanks @danas.zuokas. This version is slightly faster than the previous by still far behind my proposed loop. It looks like there is no obvious way to get this logic effectively implemented in R! Resorting to Julia that handles fast explicit loops!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.