1

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)

I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.

Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):

start lat  start long    end_lat   end_long
0  38.902760  -77.038630  38.880300 -76.986200
2  38.895914  -77.026064  38.915400 -77.044600
3  38.888251  -77.049426  38.895914 -77.026064
4  38.892300  -77.043600  38.888251 -77.049426

I've set up a function:

def dist_calc(st_lat, st_long, fin_lat, fin_long):
   from geopy.distance import vincenty
   start = (st_lat, st_long)
   end = (fin_lat, fin_long)
   return vincenty(start, end).miles

This one works fine when given manual input.

However, when I try to apply() the function, I run into trouble with the below code:

distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)

I'm fairly new to python, any help will be much appreciated!

Edit: error message:

distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
    ignore_failures=ignore_failures)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
    results[i] = func(v)
  File "<stdin>", line 1, in <lambda>
  File "<stdin>", line 5, in dist_calc2
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
    super(vincenty, self).__init__(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
    kilometers += self.measure(a, b)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
    u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')
9
  • If I copy your data, and run your code, it works. Something must be different your data, and/or code. Also, why don't you share what trouble you're having. Also, read MCVE. It will help us help you. Commented Oct 14, 2017 at 5:57
  • "I run into trouble"? What trouble??? What possesses people to post such vagueness? Commented Oct 14, 2017 at 5:57
  • @Ribzy, what is the error message you get? Commented Oct 14, 2017 at 8:08
  • Sorry, it threw so many that I thought it was an error in the syntax. I've added it now. Commented Oct 14, 2017 at 8:24
  • hi, i originally had the same error when i copied the data at the top of your post and used df = pd.read_clipboard() to setup the dataframe. The spaces in the column names messed up the dataframe (i guess read_clipboard() throught they were separate column names). Once i manually fixed this, it worked fine. So my guess is there is something wrong with your data. Also, geopy should throw a more user friendly error when passed something unexpected (a NaN in my case) Commented Oct 14, 2017 at 12:59

1 Answer 1

1

The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:

In [23]: df = pd.read_clipboard()

In [24]: df
Out[24]:
   start        lat    start.1       long    end_lat  end_long
0      0  38.902760 -77.038630  38.880300 -76.986200       NaN
1      2  38.895914 -77.026064  38.915400 -77.044600       NaN
2      3  38.888251 -77.049426  38.895914 -77.026064       NaN
3      4  38.892300 -77.043600  38.888251 -77.049426       NaN

In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')

Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.

Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:

  1. clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
  2. use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.

Here's an example of case (2):

In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')

In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)

In [37]: df
Out[37]:
   start_lat  start_long    end_lat   end_long
0  38.902760  -77.038630  38.880300 -76.986200
2  38.895914  -77.026064  38.915400 -77.044600
3  38.888251  -77.049426  38.895914 -77.026064
4  38.892300  -77.043600  38.888251 -77.049426

The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.

From this point your function / apply works fine, but i've changed it a little:

  • PEP8 recommends putting imports at the top of each file, rather than in a function
  • Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.

For example:

In [51]: def dist_calc(row):
    ...:    start = row[['start_lat','start_long']]
    ...:    end = row[['end_lat', 'end_long']]
    ...:    return vincenty(start, end).miles
    ...:

In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0    3.223232
2    1.674780
3    1.365851
4    0.420305
dtype: float64
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much. It helped getting me there in the end. It does throw the same error when exposed to NaN. So will need to put in a provision for this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.