3

I have a dataframe df that looks like that :

id          id_latlong
1          (46.1988400;5.209562)
2          (46.1988400;5.209562)
3          (46.1988400;5.209562)
4          (46.1988400;5.209562)
5         (46.438805;5.11890299)
6         (46.222993;5.21707600)
7           (46.195183;5.212575)
8           (46.195183;5.212575)
9           (46.195183;5.212575)
10          (48.917459;2.570821)
11          (48.917459;2.570821)

Every row is a location and the data in the column "id_latlong" are coordinates.

I want to select the id of every location that is at less than 800 meters from a defined location :

defined_location_latlong = "(46.1988400;5.209562)"

I have a function that calcule the distance, in meters, between two coordinates:

def distance_btw_coordinates (id_latlong1, id_latlong2) :
    try : 
        R = 6372800  # Earth radius in meters

        lat1 = float(id_latlong1.partition('(')[2].partition(';')[0])
        lon1 = float(id_latlong1.partition(';')[2].partition(')')[0])

        lat2 = float(id_latlong2.partition('(')[2].partition(';')[0])
        lon2 = float(id_latlong2.partition(';')[2].partition(')')[0])

        phi1, phi2 = math.radians(lat1), math.radians(lat2) 
        dphi       = math.radians(lat2 - lat1)
        dlambda    = math.radians(lon2 - lon1)

        a = math.sin(dphi/2)**2 + \
            math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2

        distance = 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))
    except :
        distance = 1000000000

    return distance

In order to select every row that is at less than 800 meters from the defined location, I tried :

df.loc[distance_btw_cohordonates(df['id_latlong'], defined_location_latlong ) < 800]

But it doesn't work :

KeyError: False

It doesn't work because the function takes all the data in the column "id_latlong" at once...

Do you know how I could do this without having to iterate ?

Thank you !

EDIT : I have 500k different defined locations, I would prefer not having to stock the distance between every row in df and every defined location... Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

2 Answers 2

3

I think you need processing function for each value of column separately by Series.apply:

s = df['id_latlong'].apply(lambda x: distance_btw_coordinates(x, defined_location_latlong))
print (s)
0     1000000000
1     1000000000
2     1000000000
3     1000000000
4     1000000000
5     1000000000
6     1000000000
7     1000000000
8     1000000000
9     1000000000
10    1000000000
Name: id_latlong, dtype: int64

df.loc[s < 800]

EDIT:

Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

One idea is use vectorizes function haversine_np, but is necessary change your code for parsing strings to columns and also to numeric:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
print (df)
    id              id_latlong        lat      long
0    1   (46.1988400;5.209562)  46.198840  5.209562
1    2   (46.1988400;5.209562)  46.198840  5.209562
2    3   (46.1988400;5.209562)  46.198840  5.209562
3    4   (46.1988400;5.209562)  46.198840  5.209562
4    5  (46.438805;5.11890299)  46.438805  5.118903
5    6  (46.222993;5.21707600)  46.222993  5.217076
6    7    (46.195183;5.212575)  46.195183  5.212575
7    8    (46.195183;5.212575)  46.195183  5.212575
8    9    (46.195183;5.212575)  46.195183  5.212575
9   10    (48.917459;2.570821)  48.917459  2.570821
10  11    (48.917459;2.570821)  48.917459  2.570821

lat, long = tuple(map(float, defined_location_latlong.strip('()').split(';')))
print (lat, long)
46.19884 5.209562

s = haversine_np(df['long'], df['lat'], lat, long)
print (s)
0     6016.063040
1     6016.063040
2     6016.063040
3     6016.063040
4     6037.462224
5     6017.186477
6     6015.635700
7     6015.635700
8     6015.635700
9     6353.080382
10    6353.080382
dtype: float64

#km output
df.loc[s < 0.8]

EDIT1:

For improve performance of spliting is possible use:

#550000 rows for test
df = pd.concat([df] * 50000, ignore_index=True)

df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)

print (df)

In [38]: %timeit df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
2.49 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [39]: %timeit df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
937 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you. Do you know if it's possible in one step, without having to stock every distance ?
It calculates very quickly the distances, but I still have the intermediate step with the saved distances. The dataframe has 700k rows and there is 500k defined locations, that would mean saving 350 billions distances.
@SimbaIsLearning hmm, so need faster df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float) ?
Can you explain the results of: s = haversine_np(df['long'], df['lat'], lat, long)? It seems like there should be more variation in the distances.
@MatthewBorish - Agree, I use simple vectorized haversine, but if OP need it not idea.
|
1
pd.set_option('display.float_format', lambda x: '%.6f' % x)    
from scipy.spatial import KDTree
import pandas as pd
df = pd.read_clipboard()
print(df)

    id  id_latlong
0   1   (46.1988400;5.209562)
1   2   (46.1988400;5.209562)
2   3   (46.1988400;5.209562)
3   4   (46.1988400;5.209562)
4   5   (46.438805;5.11890299)
5   6   (46.222993;5.21707600)
6   7   (46.195183;5.212575)
7   8   (46.195183;5.212575)
8   9   (46.195183;5.212575)
9   10  (48.917459;2.570821)
10  11  (48.917459;2.570821)

Condition the df.

df = df['id_latlong'].str.split(";", expand=True)
df['lat'] = df[0].str.replace('(', '')
df['lon'] = df[1].str.replace(')', '')
df['lat'] = pd.to_numeric(df['lat'])
df['lon'] = pd.to_numeric(df['lon'])
print(df.head(3))

      0          1           lat         lon
0   (46.1988400 5.209562)   46.19884    5.209562
1   (46.1988400 5.209562)   46.19884    5.209562
2   (46.1988400 5.209562)   46.19884    5.209562

Convert to UTM 31 N so we can have meters distance instead of lat/long.

dl_df = pd.DataFrame({'lat':[46.1988400], 'lon':5.209562})
dl_gdf = gpd.GeoDataFrame(dl_df, geometry=gpd.points_from_xy(dl_df.lon, dl_df.lat))

dl_gdf.crs = 4326
dl_gdf = dl_gdf.to_crs(32631)

dl_gdf['E'] = dl_gdf['geometry'].x
dl_gdf['N'] = dl_gdf['geometry'].y

print(dl_gdf)
    lat lon geometry    E   N
0   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon, df.lat))

gdf.crs = 4326
gdf = gdf.to_crs(32631)

gdf['E'] = gdf['geometry'].x
gdf['N'] = gdf['geometry'].y

print(gdf)

0   1   lat lon geometry    E   N
0   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
1   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
2   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
3   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
4   (46.438805  5.11890299) 46.438805   5.118903    POINT (662767.928 5144985.070)  662767.928322   5144985.069816
5   (46.222993  5.21707600) 46.222993   5.217076    POINT (670980.609 5121213.169)  670980.608959   5121213.168557
6   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
7   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
8   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
9   (48.917459  2.570821)   48.917459   2.570821    POINT (468556.965 5418368.922)  468556.964829   5418368.922484
10  (48.917459  2.570821)   48.917459   2.570821    POINT (468556.965 5418368.922)  468556.964829   5418368.922484

Find the distances in meters with KDTree. If there was more than one row in dl_gdf the indices in new_df would be the closest point.

join_cols = ['E', 'N']
tree = KDTree(dl_gdf[join_cols])
distance, indices = tree.query(gdf[join_cols])

new_df = pd.DataFrame({'distance':distance, 'indices': indices})

print(new_df)

    distance    indices
0   0.000000    0
1   0.000000    0
2   0.000000    0
3   0.000000    0
4   27571.018688    0
5   2746.525845 0
6   468.301937  0
7   468.301937  0
8   468.301937  0
9   361503.217161   0
10  361503.217161   0

Get the rows with points < 800m.

less_than_800m_df = new_df.loc[new_df['distance'] < 800]
print(less_than_800m_df)

    distance    indices
0   0.000000    0
1   0.000000    0
2   0.000000    0
3   0.000000    0
6   468.301937  0
7   468.301937  0
8   468.301937  0

Here is a validation image from QGIS. The precision isn't great with the manual measurement, but the results in the new_df look correct.

enter image description here

Here is a closeup for new_df idxs 6,7, and 8.

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.