4

I am very new to R and searched about this on forums but couldn't get a close enough solution for this. I am trying to do a mapping between the ip address & corresponding geo locations. I have 2 data sets.

Set-a (1,60,000 rows):
ip(int) | ID(int)

Set-b (16,00,000 rows):
Ip1(int) | Ip2(int) | Code(str) | Country(str) | Area1(str) | Area2(str)

I am trying to do the following: if ip lies between Ip1 & Ip2 then add Country & Region to Set-a.

I am doing the following (obviously not a very good way to do this):

ip1<-as.numeric(b$Ip1)
ip2<-as.numeric(b$Ip2)
country<-b$Country
area1<-b$Area1
area2<-b$Area2

for(i in 1:160000){
  for(j in 1:1674303){
    if(a[i]>ip1[j] & a[i]<ip2[j]) {
                                   a$country[i]<-country[j]
                                   a$area1[i]<-area1[j]
                                   a$area2[i]<-area2[j]}
   }
}

Can someone please tell me an efficient way to do this. This is taking a lot of time. (for i=1 to 100 took some 10 mins to run).

The sample data set-b is:

Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"

It is in continuos increasing order.

The dput(head(a)) & dput(head(b)) respectively are: (refer sample data above)

structure(IP_Addr = c("38825563", "38921619", "42470287", "42471923","42473368","42473428"), 
 Desc_value = c("0", "1.2", "4.97", "1", "5.9", "22.06")), .Names = c("IP_Addr", "Desc_value"), row.names = c(NA, 6L), class = "data.frame")

structure(list(Ip1 = c("0", "16777216", "16777472", "16778240", 
"16778496", "16778752"), Ip2 = c("16777215", "16777471", "16778239", 
"16778495", "16778751", "16779263"), Code = c("-", "AU", "CN", 
"AU", "AU", "AU"), Country = c("-", "AUSTRALIA", "CHINA", "AUSTRALIA", 
"AUSTRALIA", "AUSTRALIA"), Area1 = c("-", "QUEENSLAND", "FUJIAN", 
"VICTORIA", "NEW SOUTH WALES", "-"), Area2 = c("-", "SOUTH BRISBANE", 
"FUZHOU", "MELBOURNE", "SYDNEY", "-")), .Names = c("Ip1", "Ip2", 
"Code", "Country", "Area1", "Area2"), row.names = c(NA, 6L), class = "data.frame")
7
  • Maybe you could provide a small example data set? Commented May 20, 2013 at 13:08
  • 2
    This will be easier if the Ip1 and Ip2 ranges are different for every country/region in Set-b, so can you tell us if that's the case? And even if not, I bet if you first sort your sets so the IP values are in order, it'll lead to a simpler "screening" algorithm. Commented May 20, 2013 at 13:20
  • @Frank, Carl : Yes, the ranges are different & continuos increasing order. Sample data set is Ip1, Ip2, Code, Country, Area1, Area2 "16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE" "16777472","16778239","CN","CHINA","FUJIAN","FUZHOU" "16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE" Commented May 20, 2013 at 14:35
  • Please, a sample data set that we can read in. :) Also, it would be better to edit your question to add the information. I forgot to mention: you can use dput(a) and dput(b)... Commented May 20, 2013 at 14:42
  • 2
    Welcome to Stack Overflow! As you can see, we're very confused about the state of your data. Please create a reproducible example. As Frank mentioned, pasting the output of dput(head(a)) and dput(head(b)) would be very useful. Commented May 20, 2013 at 14:51

3 Answers 3

4

Here's a data.table solution:

# Let's take Blue Magister's example set:
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)

# include library, convert to data.table
library(data.table)

a = data.table(a)
b = data.table(b, key = "Ip1")

# and now the actual computation
a = b[a, roll = Inf][, Ip2 := NULL] # yep, amazingly, it's *that* simple in data.table
setnames(a, "Ip1", "ip")            # you can also include, exclude whatever columns you want
a
#          ip Code   Country      Area1          Area2
# 1: 16777995   CN     CHINA     FUJIAN         FUZHOU
# 2: 16777687   CN     CHINA     FUJIAN         FUZHOU
# 3: 16777871   CN     CHINA     FUJIAN         FUZHOU
# 4: 16778280   AU AUSTRALIA   VICTORIA      MELBOURNE
# 5: 16777346   AU AUSTRALIA QUEENSLAND SOUTH BRISBANE
# 6: 16777562   CN     CHINA     FUJIAN         FUZHOU
# 7: 16777637   CN     CHINA     FUJIAN         FUZHOU
# 8: 16777634   CN     CHINA     FUJIAN         FUZHOU
# 9: 16778161   CN     CHINA     FUJIAN         FUZHOU
#10: 16777875   CN     CHINA     FUJIAN         FUZHOU

Had Ip1 been an exhaustive list of numbers that ip could match, then above would simply be a merge (of Ip1 in b with first column of a, i.e. ip), but data.table also provides an option of what to do when there is no exact match. You can tell it to e.g. roll the previous observation forward (which is what I did above), or roll it back or roll to the nearest observation - see ?data.table for a little more information.

Sign up to request clarification or add additional context in comments.

2 Comments

Wow...Thanks a tonne. This does not take even 2 secs :) . I tried searching but still not very clear on how this is working.
@GauravNemade I added a short explanation, hope that helps
3

Couldn't you remove the 2nd loop using,

j = intersect(which(ip1 < x[i]), which(ip2 > x[i]))
if  (length(j)==1){
         a$country[i]<-country[j]
         a$area1[i]<-area1[j]
         a$area2[i]<-area2[j]
}else{
         cat("Multiple matches found!\n")  
}

2 Comments

Thanks for the help. Didn't know about intersect & which. This makes the process faster but it is still taking a long time (around 120 iterations/min - 22 hours for entire process). Is there some a faster way than this?
@GauravNemade - was data.table of findInterval faster? "Accept" the answer that worked best for you.
3

I would try findInterval:

#create example
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)
indices <- findInterval(a$ip,b$Ip1,rightmost.closed=FALSE,all.inside=FALSE)
a <- data.frame(a,b[indices,c("Country","Area1","Area2")])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.