I have 2 sets of nodes represented by id1 & id2. I have a data.table containing penalties for pairs of nodes -- key (id1, id2), value is the penalty.
How do I effectively range the data.table to pairs of nodes with minimal penalty such that each node (id1 and id2) appears once?
Simple Example:
Input data.table:
dtIn <- data.table(
id1 = rep(letters[1:3], each=3)
, id2 = rep(1:3, 3)
, penalty = 1:9
)
setkey(dtIn, id1, id2)
print(dtIn)
id1 id2 penalty
1: a 1 1
2: a 2 2
3: a 3 3
4: b 1 4
5: b 2 5
6: b 3 6
7: c 1 7
8: c 2 8
9: c 3 9
Desired output data.table:
id1 id2 penalty
1: a 1 1
2: b 2 5
3: c 3 9
I know how to implement the algorithm writing a loop: sort by penalty, loop through records and pick each pair in order if none of the nodes was previously matched. See code below.
But of course such a loop runs unbearably slow with my real-size data.
Manual loop function that is logically correct but performs way too poorly:
manualIter <- function(dtIn) {
setkey(dtIn, penalty, id1, id2) # Enusred ordered by penalty.
id1Match <- NULL; id2Match <- NULL; pen <- NULL;
for (i in seq_len(nrow(dtIn))) {
if (!(dtIn$id1[i] %in% id1Match) && !(dtIn$id2[i] %in% id2Match)) {
id1Match <- c(id1Match, dtIn$id1[i])
id2Match <- c(id2Match, dtIn$id2[i])
pen <- c(pen, dtIn$penalty[i])
}
}
# Build the return data.table for the matching ids.
dtf <- data.table(id1 = id1Match, id2 = id2Match, penalty = pen)
setkey(dtf, id1, id2)
return(dtf)
}
So the question is how to efficiently vectorize this algorithm?