How to select rows in Python Dataframe with a condition based on a function using a column

Question

I have a dataframe df that looks like that :

id          id_latlong
1          (46.1988400;5.209562)
2          (46.1988400;5.209562)
3          (46.1988400;5.209562)
4          (46.1988400;5.209562)
5         (46.438805;5.11890299)
6         (46.222993;5.21707600)
7           (46.195183;5.212575)
8           (46.195183;5.212575)
9           (46.195183;5.212575)
10          (48.917459;2.570821)
11          (48.917459;2.570821)

Every row is a location and the data in the column "id_latlong" are coordinates.

I want to select the id of every location that is at less than 800 meters from a defined location :

defined_location_latlong = "(46.1988400;5.209562)"

I have a function that calcule the distance, in meters, between two coordinates:

def distance_btw_coordinates (id_latlong1, id_latlong2) :
    try : 
        R = 6372800  # Earth radius in meters

        lat1 = float(id_latlong1.partition('(')[2].partition(';')[0])
        lon1 = float(id_latlong1.partition(';')[2].partition(')')[0])

        lat2 = float(id_latlong2.partition('(')[2].partition(';')[0])
        lon2 = float(id_latlong2.partition(';')[2].partition(')')[0])

        phi1, phi2 = math.radians(lat1), math.radians(lat2) 
        dphi       = math.radians(lat2 - lat1)
        dlambda    = math.radians(lon2 - lon1)

        a = math.sin(dphi/2)**2 + \
            math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2

        distance = 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))
    except :
        distance = 1000000000

    return distance

In order to select every row that is at less than 800 meters from the defined location, I tried :

df.loc[distance_btw_cohordonates(df['id_latlong'], defined_location_latlong ) < 800]

But it doesn't work :

KeyError: False

It doesn't work because the function takes all the data in the column "id_latlong" at once...

Do you know how I could do this without having to iterate ?

Thank you !

EDIT : I have 500k different defined locations, I would prefer not having to stock the distance between every row in df and every defined location... Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

jezrael · Accepted Answer · 2020-05-02 05:17:36Z

3

I think you need processing function for each value of column separately by Series.apply:

s = df['id_latlong'].apply(lambda x: distance_btw_coordinates(x, defined_location_latlong))
print (s)
0     1000000000
1     1000000000
2     1000000000
3     1000000000
4     1000000000
5     1000000000
6     1000000000
7     1000000000
8     1000000000
9     1000000000
10    1000000000
Name: id_latlong, dtype: int64

df.loc[s < 800]

EDIT:

Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

One idea is use vectorizes function haversine_np, but is necessary change your code for parsing strings to columns and also to numeric:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
print (df)
    id              id_latlong        lat      long
0    1   (46.1988400;5.209562)  46.198840  5.209562
1    2   (46.1988400;5.209562)  46.198840  5.209562
2    3   (46.1988400;5.209562)  46.198840  5.209562
3    4   (46.1988400;5.209562)  46.198840  5.209562
4    5  (46.438805;5.11890299)  46.438805  5.118903
5    6  (46.222993;5.21707600)  46.222993  5.217076
6    7    (46.195183;5.212575)  46.195183  5.212575
7    8    (46.195183;5.212575)  46.195183  5.212575
8    9    (46.195183;5.212575)  46.195183  5.212575
9   10    (48.917459;2.570821)  48.917459  2.570821
10  11    (48.917459;2.570821)  48.917459  2.570821

lat, long = tuple(map(float, defined_location_latlong.strip('()').split(';')))
print (lat, long)
46.19884 5.209562

s = haversine_np(df['long'], df['lat'], lat, long)
print (s)
0     6016.063040
1     6016.063040
2     6016.063040
3     6016.063040
4     6037.462224
5     6017.186477
6     6015.635700
7     6015.635700
8     6015.635700
9     6353.080382
10    6353.080382
dtype: float64

#km output
df.loc[s < 0.8]

EDIT1:

For improve performance of spliting is possible use:

#550000 rows for test
df = pd.concat([df] * 50000, ignore_index=True)

df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)

print (df)

In [38]: %timeit df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
2.49 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [39]: %timeit df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
937 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited May 2, 2020 at 5:17

answered May 2, 2020 at 4:26

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

SimbaIsLearning Over a year ago

Thank you. Do you know if it's possible in one step, without having to stock every distance ?

SimbaIsLearning Over a year ago

It calculates very quickly the distances, but I still have the intermediate step with the saved distances. The dataframe has 700k rows and there is 500k defined locations, that would mean saving 350 billions distances.

jezrael Over a year ago

@SimbaIsLearning hmm, so need faster df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float) ?

Matthew Borish Over a year ago

Can you explain the results of: s = haversine_np(df['long'], df['lat'], lat, long)? It seems like there should be more variation in the distances.

jezrael Over a year ago

@MatthewBorish - Agree, I use simple vectorized haversine, but if OP need it not idea.

|

Matthew Borish · Accepted Answer · 2020-05-02 08:25:35Z

pd.set_option('display.float_format', lambda x: '%.6f' % x)    
from scipy.spatial import KDTree
import pandas as pd
df = pd.read_clipboard()
print(df)

    id  id_latlong
0   1   (46.1988400;5.209562)
1   2   (46.1988400;5.209562)
2   3   (46.1988400;5.209562)
3   4   (46.1988400;5.209562)
4   5   (46.438805;5.11890299)
5   6   (46.222993;5.21707600)
6   7   (46.195183;5.212575)
7   8   (46.195183;5.212575)
8   9   (46.195183;5.212575)
9   10  (48.917459;2.570821)
10  11  (48.917459;2.570821)

Condition the df.

df = df['id_latlong'].str.split(";", expand=True)
df['lat'] = df[0].str.replace('(', '')
df['lon'] = df[1].str.replace(')', '')
df['lat'] = pd.to_numeric(df['lat'])
df['lon'] = pd.to_numeric(df['lon'])
print(df.head(3))

      0          1           lat         lon
0   (46.1988400 5.209562)   46.19884    5.209562
1   (46.1988400 5.209562)   46.19884    5.209562
2   (46.1988400 5.209562)   46.19884    5.209562

Convert to UTM 31 N so we can have meters distance instead of lat/long.

dl_df = pd.DataFrame({'lat':[46.1988400], 'lon':5.209562})
dl_gdf = gpd.GeoDataFrame(dl_df, geometry=gpd.points_from_xy(dl_df.lon, dl_df.lat))

dl_gdf.crs = 4326
dl_gdf = dl_gdf.to_crs(32631)

dl_gdf['E'] = dl_gdf['geometry'].x
dl_gdf['N'] = dl_gdf['geometry'].y

print(dl_gdf)
    lat lon geometry    E   N
0   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon, df.lat))

gdf.crs = 4326
gdf = gdf.to_crs(32631)

gdf['E'] = gdf['geometry'].x
gdf['N'] = gdf['geometry'].y

print(gdf)

0   1   lat lon geometry    E   N
0   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
1   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
2   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
3   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
4   (46.438805  5.11890299) 46.438805   5.118903    POINT (662767.928 5144985.070)  662767.928322   5144985.069816
5   (46.222993  5.21707600) 46.222993   5.217076    POINT (670980.609 5121213.169)  670980.608959   5121213.168557
6   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
7   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
8   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
9   (48.917459  2.570821)   48.917459   2.570821    POINT (468556.965 5418368.922)  468556.964829   5418368.922484
10  (48.917459  2.570821)   48.917459   2.570821    POINT (468556.965 5418368.922)  468556.964829   5418368.922484

Find the distances in meters with KDTree. If there was more than one row in dl_gdf the indices in new_df would be the closest point.

join_cols = ['E', 'N']
tree = KDTree(dl_gdf[join_cols])
distance, indices = tree.query(gdf[join_cols])

new_df = pd.DataFrame({'distance':distance, 'indices': indices})

print(new_df)

    distance    indices
0   0.000000    0
1   0.000000    0
2   0.000000    0
3   0.000000    0
4   27571.018688    0
5   2746.525845 0
6   468.301937  0
7   468.301937  0
8   468.301937  0
9   361503.217161   0
10  361503.217161   0

Get the rows with points < 800m.

less_than_800m_df = new_df.loc[new_df['distance'] < 800]
print(less_than_800m_df)

    distance    indices
0   0.000000    0
1   0.000000    0
2   0.000000    0
3   0.000000    0
6   468.301937  0
7   468.301937  0
8   468.301937  0

Here is a validation image from QGIS. The precision isn't great with the manual measurement, but the results in the new_df look correct.

Here is a closeup for new_df idxs 6,7, and 8.

Collectives™ on Stack Overflow

How to select rows in Python Dataframe with a condition based on a function using a column

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related