I have a dataframe (with more rows and columns) as shown below.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# | col1| col2| col3|
# +------+-------------+------+
# | z1| [a1, b2, c3]| foo|
# +------+-------------+------+
df
# DataFrame[col1: string, col2: string, col3: string]
What I want:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| foo|
| z1| b2| foo|
| z1| c3| foo|
+-----+-----+-----+
I tried to replicate the RDD solution provided here: Pyspark: Split multiple array columns into rows
(df
.rdd
.flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
.toDF(["col1", "col2", "col3"]))
However, it is not giving the required result
Edit: The explode option does not work because it is currently stored as string and the explode function expects an array
explodeafter converting the string to an array, which can be done withsplitandregexp_replace.explodewill not work since all my columns are stored as string