1

I have a Pandas' dataframe(1 billion records) and need to look up location info from another dataframe. This method works but I am wondering if there is a better way to perform this operation.

First, create geo dataframe

import pandas as pd
import shapefile
from matplotlib import path

#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]

enter image description here

Second, create a dataframe that contains my data. It actually has 1 billion records.

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])

Third, look up geo info.

Is there more efficient/cleaner way to write this? I feel like there might be a pandas way to avoid iterrows().

def get_location(row):    
    for _, g in geo.iterrows():
        match = g.Path.contains_point(row['latlon'])
        if match:
            return g[['City', 'Name']]

df.join(df.apply(get_location, axis=1))

enter image description here

9
  • Might be able to do boolean indexing instead of the for loop in get_location() to get your match. What does this yield? geo[geo['Path'].contains_point((-73.973943, 40.760632))] Also, take a look at geopandas. I think it could be handy here Commented Apr 12, 2017 at 3:31
  • geo[geo['Path'].contains_point((-73.973943, 40.760632))] yields boolean values and based on them the function returns the current neighborhood info, g. OK will check out geopandas. Commented Apr 12, 2017 at 3:37
  • Ok, yeah, but that line should allow you to omit the for loop, correct? Commented Apr 12, 2017 at 3:40
  • How can I avoid iterrows or for loop? Commented Apr 12, 2017 at 3:53
  • What did you end up doing? Commented Apr 14, 2017 at 20:12

2 Answers 2

1

The OP, E.K., found a nifty geopandas function called sjoin

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

Reading in the shape file

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')

Converting our pandas dataframe to a geopandas dataframe. Note: We're using the same Coordinate Reference Systems(CRS) as the shape file. This is necessary in order for us to join the two frames together

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']] 
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)
print (geo.crs, gdf.crs)
>> {'init': 'epsg:4269'} {'init': 'epsg:4269'}

and now the join using 'within' ie what points in gdf are within the polygons of geo

gpd.tools.sjoin(gdf, geo, how='left', op='within')

Some timing notes:

The OP's solution

import pandas as pd
import shapefile
from matplotlib import path

#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])

def get_location(row):    
    for _, g in geo.iterrows():
        match = g.Path.contains_point(row['latlon'])
        if match:
            return g[['City', 'Name']]

%timeit df.join(df.apply(get_location, axis=1))
>> 10 loops, best of 3: 91.1 ms per loop

My 1st answer using geopandas, apply() and boolean indexing

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')  

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
df['geometry'] = [Point(xy) for xy in df['latlon']] 


def get_location(row):  
    return pd.Series(geo[geo.contains(row['geometry'])][['City', 'Name']].values[0])

%timeit df.join(df.apply(get_location, axis=1))
>> 100 loops, best of 3: 15.3 ms per loop

Using sjoin

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']] 
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)

%timeit gpd.tools.sjoin(gdf, geo, how='left', op='within')
>> 10 loops, best of 3: 53.3 ms per loop

Although sjoin isn't the fastest, it might be the best (handles no matches, more functionality in join types and operations)

Sign up to request clarification or add additional context in comments.

1 Comment

This is great! Thank you so much Bob!
1

This answer avoids the iterrows approach (thus faster), but its still using apply(axis=1) which is not great especially when it comes to your estimation of a billion rows. Also, i'm using geopandas and shapely here

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geopandas has read_file() which is great for shape files

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')

Using Shapely to handle the points

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
df['point'] = [Point(xy) for xy in df['latlon']] 

Using geopandas contains() and some boolean indexing. Note: you might have to put in some logic to handle 'no match' situations

def get_location(row):  
    return pd.Series(geo[geo.contains(row['point'])][['City', 'Name']].values[0])

df.join(df.apply(get_location, axis=1))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.