Split Spark Dataframe name column into three columns

Question

I have a dataframe in Spark, the column is name, it is a string delimited by space, the tricky part is some names have middle name, others don't. How can I split the column into firstname, middlename and lastname? I am using F.split, dunno how to differentiate middle name and last name. I understand I cannot use negative index in Spark. Take a look at my sample df

from pyspark.sql import functions as F
cols = ['id', 'name']
vals = [('l03', 'Bob K Barry'), ('S20', 'Cindy Winston'), ('l10', 'Jerry Kyle Moore'), ('j31', 'Dora Larson')]
df = spark.createDataFrame(vals, cols)
df.show()
+---+----------------+                                                          
| id|            name|
+---+----------------+
|l03|     Bob K Barry|
|S20|   Cindy Winston|
|l10|Jerry Kyle Moore|
|j31|     Dora Larson|
+---+----------------+


split_col = F.split(df['name'], ' ')
df = df.withColumn('firstname', split_col.getItem(0))
df.show()
+---+----------------+---------+                                                
| id|            name|firstname|
+---+----------------+---------+
|l03|     Bob K Barry|      Bob|
|S20|   Cindy Winston|    Cindy|
|l10|Jerry Kyle Moore|    Jerry|
|j31|     Dora Larson|     Dora|
+---+----------------+---------+

How do I continue to split? Appreciated.

Surya · Accepted Answer · 2020-10-27 05:26:34Z

Have the first element in the array always as the firstname and the last element as lastname (using size). If there cannot be more than 1 middle name, you can do:

from pyspark.sql import functions as F
from pyspark.sql.functions import *

df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
    .withColumn("ln", col("split_list")[F.size("split_list") - 1])\
    .withColumn("mn", when(F.size("split_list")==2, None)\
    .otherwise(col("split_list")[1])).drop("split_list").show()
+---+----------------+-----+-------+----+
| id|            name|   fn|     ln|  mn|
+---+----------------+-----+-------+----+
|l03|     Bob K Barry|  Bob|  Barry|   K|
|S20|   Cindy Winston|Cindy|Winston|null|
|l10|Jerry Kyle Moore|Jerry|  Moore|Kyle|
|j31|     Dora Larson| Dora| Larson|null|
+---+----------------+-----+-------+----+

If there can be more than 1 middle name, then you can use substring on name for middlename column:

df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
    .withColumn("ln", col("split_list")[F.size("split_list") - 1])\
    .withColumn("mn", when(F.size("split_list")==2, None)\
    .otherwise(col('name').substr(F.length("fn")+2, \
    F.length("name")-F.length("fn")-F.length("ln")-2))).drop("split_list").show()
+---+----------------+-----+-------+-----+
| id|            name|   fn|     ln|   mn|
+---+----------------+-----+-------+-----+
|l03|     Bob K Barry|  Bob|  Barry|    K|
|S20|   Cindy Winston|Cindy|Winston| null|
|l10|Jerry Kyle Moore|Jerry|  Moore| Kyle|
|j31|     Dora Larson| Dora| Larson| null|
|A12|     Fn A B C Ln|   Fn|     Ln|A B C|
+---+----------------+-----+-------+-----+

I'm assuming that the FN is the first element, and the LN is the last element, and anything in between is the MN. This is not always true as people can have multiple FN/LN.

I checked my dataset, there is only one middle name if there is a middlename. I tried your method, it is working like a charm, thank you.

Collectives™ on Stack Overflow

Split Spark Dataframe name column into three columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related