Replacing nested loop in R

Question

I am very new to R and searched about this on forums but couldn't get a close enough solution for this. I am trying to do a mapping between the ip address & corresponding geo locations. I have 2 data sets.

Set-a (1,60,000 rows):
ip(int) | ID(int)

Set-b (16,00,000 rows):
Ip1(int) | Ip2(int) | Code(str) | Country(str) | Area1(str) | Area2(str)

I am trying to do the following: if ip lies between Ip1 & Ip2 then add Country & Region to Set-a.

I am doing the following (obviously not a very good way to do this):

ip1<-as.numeric(b$Ip1)
ip2<-as.numeric(b$Ip2)
country<-b$Country
area1<-b$Area1
area2<-b$Area2

for(i in 1:160000){
  for(j in 1:1674303){
    if(a[i]>ip1[j] & a[i]<ip2[j]) {
                                   a$country[i]<-country[j]
                                   a$area1[i]<-area1[j]
                                   a$area2[i]<-area2[j]}
   }
}

Can someone please tell me an efficient way to do this. This is taking a lot of time. (for i=1 to 100 took some 10 mins to run).

The sample data set-b is:

Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"

It is in continuos increasing order.

The dput(head(a)) & dput(head(b)) respectively are: (refer sample data above)

structure(IP_Addr = c("38825563", "38921619", "42470287", "42471923","42473368","42473428"), 
 Desc_value = c("0", "1.2", "4.97", "1", "5.9", "22.06")), .Names = c("IP_Addr", "Desc_value"), row.names = c(NA, 6L), class = "data.frame")

structure(list(Ip1 = c("0", "16777216", "16777472", "16778240", 
"16778496", "16778752"), Ip2 = c("16777215", "16777471", "16778239", 
"16778495", "16778751", "16779263"), Code = c("-", "AU", "CN", 
"AU", "AU", "AU"), Country = c("-", "AUSTRALIA", "CHINA", "AUSTRALIA", 
"AUSTRALIA", "AUSTRALIA"), Area1 = c("-", "QUEENSLAND", "FUJIAN", 
"VICTORIA", "NEW SOUTH WALES", "-"), Area2 = c("-", "SOUTH BRISBANE", 
"FUZHOU", "MELBOURNE", "SYDNEY", "-")), .Names = c("Ip1", "Ip2", 
"Code", "Country", "Area1", "Area2"), row.names = c(NA, 6L), class = "data.frame")

This will be easier if the Ip1 and Ip2 ranges are different for every country/region in Set-b, so can you tell us if that's the case? And even if not, I bet if you first sort your sets so the IP values are in order, it'll lead to a simpler "screening" algorithm. — Carl Witthoft
– Carl Witthoft, Commented May 20, 2013 at 13:20
@Frank, Carl : Yes, the ranges are different & continuos increasing order. Sample data set is Ip1, Ip2, Code, Country, Area1, Area2 "16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE" "16777472","16778239","CN","CHINA","FUJIAN","FUZHOU" "16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE" — Gaurav Nemade
– Gaurav Nemade, Commented May 20, 2013 at 14:35
Please, a sample data set that we can read in. :) Also, it would be better to edit your question to add the information. I forgot to mention: you can use dput(a) and dput(b)... — Frank
– Frank, Commented May 20, 2013 at 14:42
Welcome to Stack Overflow! As you can see, we're very confused about the state of your data. Please create a reproducible example. As Frank mentioned, pasting the output of dput(head(a)) and dput(head(b)) would be very useful. — Blue Magister
– Blue Magister, Commented May 20, 2013 at 14:51

eddi · Accepted Answer · 2013-05-21 14:47:30Z

4

Here's a data.table solution:

# Let's take Blue Magister's example set:
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)

# include library, convert to data.table
library(data.table)

a = data.table(a)
b = data.table(b, key = "Ip1")

# and now the actual computation
a = b[a, roll = Inf][, Ip2 := NULL] # yep, amazingly, it's *that* simple in data.table
setnames(a, "Ip1", "ip")            # you can also include, exclude whatever columns you want
a
#          ip Code   Country      Area1          Area2
# 1: 16777995   CN     CHINA     FUJIAN         FUZHOU
# 2: 16777687   CN     CHINA     FUJIAN         FUZHOU
# 3: 16777871   CN     CHINA     FUJIAN         FUZHOU
# 4: 16778280   AU AUSTRALIA   VICTORIA      MELBOURNE
# 5: 16777346   AU AUSTRALIA QUEENSLAND SOUTH BRISBANE
# 6: 16777562   CN     CHINA     FUJIAN         FUZHOU
# 7: 16777637   CN     CHINA     FUJIAN         FUZHOU
# 8: 16777634   CN     CHINA     FUJIAN         FUZHOU
# 9: 16778161   CN     CHINA     FUJIAN         FUZHOU
#10: 16777875   CN     CHINA     FUJIAN         FUZHOU

Had Ip1 been an exhaustive list of numbers that ip could match, then above would simply be a merge (of Ip1 in b with first column of a, i.e. ip), but data.table also provides an option of what to do when there is no exact match. You can tell it to e.g. roll the previous observation forward (which is what I did above), or roll it back or roll to the nearest observation - see ?data.table for a little more information.

edited May 21, 2013 at 14:47

answered May 20, 2013 at 15:16

eddi

49.5k6 gold badges109 silver badges157 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gaurav Nemade Over a year ago

Wow...Thanks a tonne. This does not take even 2 secs :) . I tried searching but still not very clear on how this is working.

eddi Over a year ago

@GauravNemade I added a short explanation, hope that helps

harkmug · Accepted Answer · 2013-05-20 13:10:36Z

3

Couldn't you remove the 2nd loop using,

j = intersect(which(ip1 < x[i]), which(ip2 > x[i]))
if  (length(j)==1){
         a$country[i]<-country[j]
         a$area1[i]<-area1[j]
         a$area2[i]<-area2[j]
}else{
         cat("Multiple matches found!\n")  
}

answered May 20, 2013 at 13:10

harkmug

2,79523 silver badges26 bronze badges

2 Comments

Gaurav Nemade Over a year ago

Thanks for the help. Didn't know about intersect & which. This makes the process faster but it is still taking a long time (around 120 iterations/min - 22 hours for entire process). Is there some a faster way than this?

harkmug Over a year ago

@GauravNemade - was data.table of findInterval faster? "Accept" the answer that worked best for you.

Blue Magister · Accepted Answer · 2013-05-20 15:04:08Z

3

I would try findInterval:

#create example
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)
indices <- findInterval(a$ip,b$Ip1,rightmost.closed=FALSE,all.inside=FALSE)
a <- data.frame(a,b[indices,c("Country","Area1","Area2")])

edited May 20, 2013 at 15:04

answered May 20, 2013 at 14:57

Blue Magister

13.4k5 gold badges40 silver badges58 bronze badges

Collectives™ on Stack Overflow

Replacing nested loop in R

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related