7

In PySpark, how to split strings in all columns to a list of string?

a = [('a|q|e','d|r|y'),('j|l|f','m|g|j')]
df = sc.createDataFrame(a,['col1','col2'])

+-----+-----+
| col1| col2|
+-----+-----+
|a|q|e|d|r|y|
|j|l|f|m|g|j|
+-----+-----+

Output expected:

+---------+---------+
|     col1|     col2|
+---------+---------+
|[a, q, e]|[d, r, y]|
|[j, l, f]|[m, g, j]|
+---------+---------+

I can do single column at a time by using withColumn but not an appealing solution with dynamic number of columns.

from pyspark.sql.functions import col, split    
outDF = df.withColumn("col1", split(col("col1"), "\\|").alias("col1"))

1 Answer 1

10

One option is to create a list of column expressions first, and then leverage select method with varargs syntax:

from pyspark.sql.functions import col, split 
cols = ['col1', 'col2']                                               # columns to split
col_exprs = [split(col(x), "\\|").alias(x) for x in cols]
df.select(*col_exprs).show()
+---------+---------+
|     col1|     col2|
+---------+---------+
|[a, q, e]|[d, r, y]|
|[j, l, f]|[m, g, j]|
+---------+---------+

Another option using reduce from functools with withColumn to dynamically create new columns:

from functools import reduce
reduce(
    lambda df, colname: df.withColumn(colname, split(col(colname), "\\|").alias(colname)), 
    cols, 
    df
).show()
+---------+---------+
|     col1|     col2|
+---------+---------+
|[a, q, e]|[d, r, y]|
|[j, l, f]|[m, g, j]|
+---------+---------+

reduce(lambda df, colname: df.withColumn(colname, split(col(colname), "\\|").alias(colname)), cols, df).explain()
# == Physical Plan ==
# *Project [split(col1#0, \|) AS col1#76, split(col2#1, \|) AS col2#81]
# +- Scan ExistingRDD[col1#0,col2#1]
Sign up to request clarification or add additional context in comments.

2 Comments

Can someone explain why the double backslashes in the split function? Thanks!
it is java regex, \\ is escape pattern, \ is in python @Luke

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.