6

I have a dataframe (with more rows and columns) as shown below.

Sample DF:

from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode

sqlc = SQLContext(sc)

df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# |  col1|         col2|  col3|
# +------+-------------+------+
# |    z1| [a1, b2, c3]|   foo|
# +------+-------------+------+

df
# DataFrame[col1: string, col2: string, col3: string]

What I want:

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
|   z1|   a1|  foo|
|   z1|   b2|  foo|
|   z1|   c3|  foo|
+-----+-----+-----+

I tried to replicate the RDD solution provided here: Pyspark: Split multiple array columns into rows

(df
    .rdd
    .flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
    .toDF(["col1", "col2", "col3"]))

However, it is not giving the required result

Edit: The explode option does not work because it is currently stored as string and the explode function expects an array

4
  • You're looking for explode after converting the string to an array, which can be done with split and regexp_replace. Commented Jul 17, 2019 at 2:27
  • Possible duplicate of Explode array data into rows in spark Commented Jul 17, 2019 at 2:28
  • Example of how to convert the string to an array here Commented Jul 17, 2019 at 2:32
  • @pault - explode will not work since all my columns are stored as string Commented Jul 17, 2019 at 15:54

2 Answers 2

4

You can use explode but first you'll have to convert the string representation of the array into an array.

One way is to use regexp_replace to remove the leading and trailing square brackets, followed by split on ", ".

from pyspark.sql.functions import col, explode, regexp_replace, split

df.withColumn(
    "col2", 
    explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|  z1|  a1| foo|
#|  z1|  b2| foo|
#|  z1|  c3| foo|
#+----+----+----+
Sign up to request clarification or add additional context in comments.

Comments

2

Pault's solution should work perfectly fine although here is another solution which uses regexp_extract instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:

from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split

df.withColumn("col2", 
              explode(
                  split(
                      regexp_extract(
                        regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()

Explanation:

  • Initially regexp_replace(col("col2"), "\s", "") will replace all spaces with empty string.
  • Next regexp_extract will extract the content of the column which start with [ and ends with ].
  • Then we execute split for the comma separated values and finally explode.

3 Comments

you don't really need to replace anything in this case but your first bullet is replacing spaces. You should be able to get around this and maintain support for arbitrary number of spaces by splitting on ", +".
actually I am removing spaces pault, what I meant is that even if you can achieve the same with regexp_replace the goal here is to extract a string hence regexp_exctract can be more suitable.
Sure, but what if the array was [a1, an element with spaces, c3]? Only OP can provide the requirements- which isn't to say your method is not useful- it does work for the given test case after all (+1).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.