1

I have a the following dataframe:

I would like to concatenate the lat and lon into a list. Where mmsi is similar to an ID (This is unique)

+---------+--------------------+--------------------+
|     mmsi|                 lat|                 lon|
+---------+--------------------+--------------------+
|255801480|[47.1018366666666...|[-5.3017783333333...|
|304182000|[44.6343033333333...|[-63.564803333333...|
|304682000|[41.1936, 41.1715...|[-8.7716, -8.7514...|
|305930000|[49.5221333333333...|[-3.6310166666666...|
|306216000|[42.8185133333333...|[-29.853155, -29....|
|477514400|[47.17205, 47.165...|[-58.6317, -58.60...|

Therefore, I would like to concatenate the lat and lon array but on axis = 1, that is, I would like to have at the end a list of lists, in a separate column, like:

[[47.1018366666666, -5.3017783333333], ... ]

How is that could be possible in pyspark dataframe? I have tried concat, but that will return:

[47.1018366666666, 44.6343033333333, ..., -5.3017783333333, -63.564803333333, ...]

Any help is much appreciated!

1 Answer 1

1

Starting Spark version 2.4, you can use the inbuilt function arrays_zip.

from pyspark.sql.functions import arrays_zip
df.withColumn('zipped_lat_lon',arrays_zip(df.lat,df.lon)).show()
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @Vamsi for your answer. I just figured out that I am getting back: [{"lat":"47.101836666666664","lon":"-5.301778333333333"},{.... how can I remove the "lat" and "lon"?
the return type of the function is a named struct. if you post what actually resulted in that output, i can take a look

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.