2

I have a DataFrame with a column with different coordinates, clustered together in other lists, like this:

    name    OBJECTID    geometry
0    NaN           1    ['-80.304852,-3.489302,0.0','-80.303087,-3.490214,0.0',...]

1    NaN           2    ['-80.27494,-3.496571,0.0',...]

2    NaN           3    ['-80.267987,-3.500003,0.0',...]

I want to separate the values and remove the '0.0', but keep them inside the lists to add them to a certain key in a dictionary, that looks like this:

    name    OBJECTID    geometry
0    NaN           1    [[-80.304852, -3.489302],[-80.303087, -3.490214],...]

1    NaN           2    [[-80.27494, -3.496571],...]

2    NaN           3    [[-80.267987, -3.500003],...]

This is my code that didn't work where I tried to separate them in a for loop:

import panda as pd
import numpy as np

r = pd.read_csv('data.csv') 
rloc = np.asarray(r['geometry'])

r['latitude'] = np.zeros(r.shape[0],dtype= r['geometry'].dtype)
r['longitude'] = np.zeros(r.shape[0],dtype= r['geometry'].dtype)

# Separating the latitude and longitude values form each string.
for i in range(0, len(rloc)):
    for j in range(0, len(rloc[i])):
        coord = rloc[i][j].split(',')
        r['longitude'] = coord[0]
        r['latitude'] = coord[1]

r = r[['OBJECTID', 'latitude', 'longitude', 'name']]

Edit: The result wasn't good because it printed out only one value for each one.

  OBJECTID  latitude    longitude   name
0        1  -3.465566   -80.151633  NaN
1        2  -3.465566   -80.151633  NaN
2        3  -3.465566   -80.151633  NaN

Bonus question: How cand I add all of these longitude and latitude values inside a tuple to use with geopy? Like this:

r['location'] = (r['latitude], r['longitude'])

So, instead, the geometry column would look like this:

geometry
[(-80.304852, -3.489302),(-80.303087, -3.490214),...]

[(-80.27494, -3.496571),...]

[(-80.267987, -3.500003),...]

Edit:

The data looked like this at first(for each row):

<LineString><coordinates>-80.304852,-3.489302,0.0 -80.303087,-3.490214,0.0 ...</coordinates></LineString>

I modified it with regex, using this code:

geo = np.asarray(r['geometry']); 
geo = [re.sub(re.compile('<.*?>'), '', string) for string in geo]

And then I placed it in an array:

rv = [geo[i].split() for i in range(0,len(geo))]
r['geometry'] = np.asarray(rv)

When I call r['geometry'], the output is:

0    [-80.304852,-3.489302,0.0, -80.303087,-3.49021...
1    [-80.27494,-3.496571,0.0, -80.271963,-3.49266,...
2    [-80.267987,-3.500003,0.0, -80.267845,-3.49789...
Name: geometry, dtype: object

And r['geometry'][0] is:

 ['-80.304852,-3.489302,0.0',
 '-80.303087,-3.490214,0.0',
 '-80.302131,-3.491878,0.0',
 '-80.300763,-3.49213,0.0']
2
  • 1
    What result did you get? Commented Feb 11, 2018 at 19:14
  • 1
    Updated with the result! It doesn't work because the list that's inside is removed... I'm trying to find a way around that. Commented Feb 11, 2018 at 19:27

2 Answers 2

2

A pandas solution with input from a toy data set:

df = pd.read_csv("test.txt")
   name  OBJECTID                                           geometry
0   NaN         1  ['-80.3,-3.4,0.0','-80.3,-3.9,0.0','-80.3,-3.9...
1   NaN         2  ['80.2,-4.4,0.0','-81.3,2.9,0.0','-80.7,-3.2,0...
2   NaN         3  ['-80.1,-3.2,0.0','-80.8,-2.9,0.0','-80.1,-1.9...

Now the transformation into columns of longitude-latitude pairs:

#regex extraction of longitude latitude pairs
pairs = "(-?\d+.\d+,-?\d+.\d+)"
s = df["geometry"].str.extractall(pairs)
#splitting string into two parts, creating two columns for longitude latitude
s = s[0].str.split(",", expand = True)  
#converting strings into float numbers - is this even necessary?
s[[0, 1]] = s[[0, 1]].apply(pd.to_numeric)
#creating a tuple from longitude/latitude columns
s["lat_long"] = list(zip(s[0], s[1]))
#placing the tuples as columns in original dataframe 
df = pd.concat([df, s["lat_long"].unstack(level = -1)], axis = 1)

Output from the toy data set:

   name  OBJECTID                                           geometry  \
0   NaN         1  ['-80.3,-3.4,0.0','-80.3,-3.9,0.0','-80.3,-3.9...   
1   NaN         2  ['80.2,-4.4,0.0','-81.3,2.9,0.0','-80.7,-3.2,0...   
2   NaN         3  ['-80.1,-3.2,0.0','-80.8,-2.9,0.0','-80.1,-1.9...   

               0              1              2  
0  (-80.3, -3.4)  (-80.3, -3.9)  (-80.3, -3.9)  
1   (80.2, -4.4)   (-81.3, 2.9)  (-80.7, -3.2)  
2  (-80.1, -3.2)  (-80.8, -2.9)  (-80.1, -1.9)  

Alternatively, you can combine the tuples in one column as a list:

s["lat_long"] = list(zip(s[0], s[1]))
#placing the tuples as a list into a column of the original dataframe 
df["lat_long"] = s.groupby(level=[0])["lat_long"].apply(list)

Output now:

   name  OBJECTID                                           geometry  \
0   NaN         1  ['-80.3,-3.4,0.0','-80.3,-3.9,0.0','-80.3,-3.9...   
1   NaN         2  ['80.2,-4.4,0.0','-81.3,2.9,0.0','-80.7,-3.2,0...   
2   NaN         3  ['-80.1,-3.2,0.0','-80.8,-2.9,0.0','-80.1,-1.9...   

                                        lat_long  
0  [(-80.3, -3.4), (-80.3, -3.9), (-80.3, -3.9)]  
1    [(80.2, -4.4), (-81.3, 2.9), (-80.7, -3.2)]  
2  [(-80.1, -3.2), (-80.8, -2.9), (-80.1, -1.9)]  
Sign up to request clarification or add additional context in comments.

7 Comments

It seems like the list 's' is empty. The line 's = df["geometry"].str.extractall(pairs)' is not doing anything and if I try to print s out, I only get an empty dataframe with 0 as the column name.
You always should provide a Minimal, Complete and Verifiable example. I worked with the information given here, but seemingly, the field geometry differs from your description. Can you upload a sample file and describe how you create the dataframe, so I can adapt the script?
Updated post with new example.
Is it possible that df['geometry'].str.extractall(pairs) doesn't work because the dtype is 'object'?
I updated the code, it works now with unequal lengths, too. The solution comes from Wen, please give him an upvote for his contribution.
|
1

In your code, you are effectively assigning the longitude and latitude values of last iteration to the complete columns. You may also convert string to float:

# Separating the latitude and longitude values form each string.
for i in range(0, len(rloc)):
    r['longitude'][i] = []
    r['latitude'][i] = []
    for j in range(0, len(rloc[i])):
        coord = rloc[i][j].split(',')
        r['longitude'][i].append(float(coord[0]))
        r['latitude'][i].append(float(coord[1]))

Going for the bonus :)

for i in range(0, len(rloc)):
    r['geometry'][i] = [
        (
            float(element.split(',')[0]),
            float(element.split(',')[1])
        ) for element in r['geometry'][i]
    ]

1 Comment

Thanks! This works! I chose the other one as the answer because it's much more computationally efficient as I have over one million entries to porcess.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.