The OP, E.K., found a nifty geopandas function called sjoin
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
Reading in the shape file
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
Converting our pandas dataframe to a geopandas dataframe. Note: We're using the same Coordinate Reference Systems(CRS) as the shape file. This is necessary in order for us to join the two frames together
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']]
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)
print (geo.crs, gdf.crs)
>> {'init': 'epsg:4269'} {'init': 'epsg:4269'}
and now the join using 'within' ie what points in gdf are within the polygons of geo
gpd.tools.sjoin(gdf, geo, how='left', op='within')
Some timing notes:
The OP's solution
import pandas as pd
import shapefile
from matplotlib import path
#downloaded and unzipped https://www.zillowstatic.com/static/shp/ZillowNeighborhoods-NY.zip
sf = shapefile.Reader('ZillowNeighborhoods-NY.shp')
cols = ['State', 'County', 'City', 'Name', 'RegionID']
geo = pd.DataFrame(sf.records(), columns=cols)
geo['Path'] = [path.Path(s.points) for s in sf.iterShapes()]
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
def get_location(row):
for _, g in geo.iterrows():
match = g.Path.contains_point(row['latlon'])
if match:
return g[['City', 'Name']]
%timeit df.join(df.apply(get_location, axis=1))
>> 10 loops, best of 3: 91.1 ms per loop
My 1st answer using geopandas, apply() and boolean indexing
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
df['geometry'] = [Point(xy) for xy in df['latlon']]
def get_location(row):
return pd.Series(geo[geo.contains(row['geometry'])][['City', 'Name']].values[0])
%timeit df.join(df.apply(get_location, axis=1))
>> 100 loops, best of 3: 15.3 ms per loop
Using sjoin
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
geo = gpd.read_file('ZillowNeighborhoods-NY.shp')
df = pd.DataFrame([('some data 1', (-73.973943, 40.760632)),
('some data 2', (-74.010087, 40.709546))],
columns=['h1', 'latlon'])
geometry = [Point(xy) for xy in df['latlon']]
gdf = gpd.GeoDataFrame(df, crs=geo.crs, geometry=geometry)
%timeit gpd.tools.sjoin(gdf, geo, how='left', op='within')
>> 10 loops, best of 3: 53.3 ms per loop
Although sjoin isn't the fastest, it might be the best (handles no matches, more functionality in join types and operations)
geo[geo['Path'].contains_point((-73.973943, 40.760632))]Also, take a look at geopandas. I think it could be handy heregeo[geo['Path'].contains_point((-73.973943, 40.760632))]yields boolean values and based on them the function returns the current neighborhood info,g. OK will check out geopandas.iterrowsor for loop?