3

I have a column with data like this:

[[[-77.1082606, 38.935738]] ,Point] 

I want it split out like:

  column 1          column 2        column 3
 -77.1082606      38.935738           Point

How is that possible using PySpark, or alternatively Scala (Databricks 3.0)? I know how to explode columns but not split up these structs. Thanks!!!

EDIT: Here is the schema of the column:

|-- geometry: struct (nullable = true)
 |    |-- coordinates: string (nullable = false)
 |    |-- type: string (nullable = false
1
  • 1
    What's the type? array<array<>>? Please post result of printSchema Commented Sep 20, 2017 at 20:24

1 Answer 1

4

You can use regexp_replace() to get rid of the square brackets, and then split() the resulting string by the comma into separate columns.

from pyspark.sql.functions import regexp_replace, split, col

df.select(regexp_replace(df.geometry.coordinates, "[\[\]]", "").alias("coordinates"),
          df.geometry.type.alias("col3")) \
  .withColumn("arr", split(col("coordinates"), "\\,")) \
  .select(col("arr")[0].alias("col1"),
          col("arr")[1].alias("col2"),
         "col3") \
  .drop("arr") \
  .show(truncate = False)
+-----------+----------+-----+
|col1       |col2      |col3 |
+-----------+----------+-----+
|-77.1082606| 38.935738|Point|
+-----------+----------+-----+
Sign up to request clarification or add additional context in comments.

4 Comments

I couldn't recall the syntax - you were faster :D +1 and I suggest @AshleyO to also give +1 and accept :)
I should have been more clear, the data is all in one struct. I've edited to display the information more clearly. I'm testing to see if this concept can help though
so you have ["[[-77.1082606, 38.935738]]" ,"Point"] ?
Correct. All in one column

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.