Pyspark DataFrame: Split column with multiple values into rows

Question

I have a dataframe (with more rows and columns) as shown below.

Sample DF:

from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode

sqlc = SQLContext(sc)

df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# |  col1|         col2|  col3|
# +------+-------------+------+
# |    z1| [a1, b2, c3]|   foo|
# +------+-------------+------+

df
# DataFrame[col1: string, col2: string, col3: string]

What I want:

+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
|   z1|   a1|  foo|
|   z1|   b2|  foo|
|   z1|   c3|  foo|
+-----+-----+-----+

I tried to replicate the RDD solution provided here: Pyspark: Split multiple array columns into rows

(df
    .rdd
    .flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
    .toDF(["col1", "col2", "col3"]))

However, it is not giving the required result

Edit: The explode option does not work because it is currently stored as string and the explode function expects an array

You're looking for explode after converting the string to an array, which can be done with split and regexp_replace. — pault
– pault, Commented Jul 17, 2019 at 2:27
@pault - explode will not work since all my columns are stored as string — SmitM
– SmitM, Commented Jul 17, 2019 at 15:54

pault · Accepted Answer · 2019-07-17 16:05:45Z

4

You can use explode but first you'll have to convert the string representation of the array into an array.

One way is to use regexp_replace to remove the leading and trailing square brackets, followed by split on ", ".

from pyspark.sql.functions import col, explode, regexp_replace, split

df.withColumn(
    "col2", 
    explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|  z1|  a1| foo|
#|  z1|  b2| foo|
#|  z1|  c3| foo|
#+----+----+----+

answered Jul 17, 2019 at 16:05

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

abiratsis · Accepted Answer · 2019-07-17 21:48:54Z

2

Pault's solution should work perfectly fine although here is another solution which uses regexp_extract instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:

from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split

df.withColumn("col2", 
              explode(
                  split(
                      regexp_extract(
                        regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()

Explanation:

Initially regexp_replace(col("col2"), "\s", "") will replace all spaces with empty string.
Next regexp_extract will extract the content of the column which start with [ and ends with ].
Then we execute split for the comma separated values and finally explode.

edited Jul 17, 2019 at 21:48

answered Jul 17, 2019 at 18:04

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

3 Comments

pault Over a year ago

you don't really need to replace anything in this case but your first bullet is replacing spaces. You should be able to get around this and maintain support for arbitrary number of spaces by splitting on ", +".

abiratsis Over a year ago

actually I am removing spaces pault, what I meant is that even if you can achieve the same with regexp_replace the goal here is to extract a string hence regexp_exctract can be more suitable.

pault Over a year ago

Sure, but what if the array was [a1, an element with spaces, c3]? Only OP can provide the requirements- which isn't to say your method is not useful- it does work for the given test case after all (+1).

Collectives™ on Stack Overflow

Pyspark DataFrame: Split column with multiple values into rows

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related