Pandas dataframe apply function that has iterrows

Question

I have a Pandas' dataframe(1 billion records) and need to look up location info from another dataframe. This method works but I am wondering if there is a better way to perform this operation.

First, create geo dataframe

import pandas as pd
import shapefile
from matplotlib import path

#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]

Second, create a dataframe that contains my data. It actually has 1 billion records.

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])

Third, look up geo info.

Is there more efficient/cleaner way to write this? I feel like there might be a pandas way to avoid iterrows().

def get_location(row):    
    for _, g in geo.iterrows():
        match = g.Path.contains_point(row['latlon'])
        if match:
            return g[['City', 'Name']]

df.join(df.apply(get_location, axis=1))

Might be able to do boolean indexing instead of the for loop in get_location() to get your match. What does this yield? geo[geo['Path'].contains_point((-73.973943, 40.760632))] Also, take a look at geopandas. I think it could be handy here — Bob Haffner
– Bob Haffner, Commented Apr 12, 2017 at 3:31
geo[geo['Path'].contains_point((-73.973943, 40.760632))] yields boolean values and based on them the function returns the current neighborhood info, g. OK will check out geopandas. — E.K.
– E.K., Commented Apr 12, 2017 at 3:37
Ok, yeah, but that line should allow you to omit the for loop, correct? — Bob Haffner
– Bob Haffner, Commented Apr 12, 2017 at 3:40

Bob Haffner · Accepted Answer · 2017-04-15 18:14:57Z

The OP, E.K., found a nifty geopandas function called sjoin

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

Reading in the shape file

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')

Converting our pandas dataframe to a geopandas dataframe. Note: We're using the same Coordinate Reference Systems(CRS) as the shape file. This is necessary in order for us to join the two frames together

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']] 
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)
print (geo.crs, gdf.crs)
>> {'init': 'epsg:4269'} {'init': 'epsg:4269'}

and now the join using 'within' ie what points in gdf are within the polygons of geo

gpd.tools.sjoin(gdf, geo, how='left', op='within')

Some timing notes:

The OP's solution

import pandas as pd
import shapefile
from matplotlib import path

#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])

def get_location(row):    
    for _, g in geo.iterrows():
        match = g.Path.contains_point(row['latlon'])
        if match:
            return g[['City', 'Name']]

%timeit df.join(df.apply(get_location, axis=1))
>> 10 loops, best of 3: 91.1 ms per loop

My 1st answer using geopandas, apply() and boolean indexing

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')  

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
df['geometry'] = [Point(xy) for xy in df['latlon']] 


def get_location(row):  
    return pd.Series(geo[geo.contains(row['geometry'])][['City', 'Name']].values[0])

%timeit df.join(df.apply(get_location, axis=1))
>> 100 loops, best of 3: 15.3 ms per loop

Using sjoin

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']] 
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)

%timeit gpd.tools.sjoin(gdf, geo, how='left', op='within')
>> 10 loops, best of 3: 53.3 ms per loop

Although sjoin isn't the fastest, it might be the best (handles no matches, more functionality in join types and operations)

Bob Haffner · Accepted Answer · 2017-04-12 17:36:36Z

This answer avoids the iterrows approach (thus faster), but its still using apply(axis=1) which is not great especially when it comes to your estimation of a billion rows. Also, i'm using geopandas and shapely here

import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

geopandas has read_file() which is great for shape files

geo = gpd.read_file('ZillowNeighborhoods-NY.shp')

Using Shapely to handle the points

df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
                   ('some data 2', (-74.010087, 40.709546))], 
                  columns=['h1', 'latlon'])
df['point'] = [Point(xy) for xy in df['latlon']]

Using geopandas contains() and some boolean indexing. Note: you might have to put in some logic to handle 'no match' situations

def get_location(row):  
    return pd.Series(geo[geo.contains(row['point'])][['City', 'Name']].values[0])

df.join(df.apply(get_location, axis=1))

Collectives™ on Stack Overflow

Pandas dataframe apply function that has iterrows

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related