1

I have a dataframe in Spark, the column is name, it is a string delimited by space, the tricky part is some names have middle name, others don't. How can I split the column into firstname, middlename and lastname? I am using F.split, dunno how to differentiate middle name and last name. I understand I cannot use negative index in Spark. Take a look at my sample df

from pyspark.sql import functions as F
cols = ['id', 'name']
vals = [('l03', 'Bob K Barry'), ('S20', 'Cindy Winston'), ('l10', 'Jerry Kyle Moore'), ('j31', 'Dora Larson')]
df = spark.createDataFrame(vals, cols)
df.show()
+---+----------------+                                                          
| id|            name|
+---+----------------+
|l03|     Bob K Barry|
|S20|   Cindy Winston|
|l10|Jerry Kyle Moore|
|j31|     Dora Larson|
+---+----------------+


split_col = F.split(df['name'], ' ')
df = df.withColumn('firstname', split_col.getItem(0))
df.show()
+---+----------------+---------+                                                
| id|            name|firstname|
+---+----------------+---------+
|l03|     Bob K Barry|      Bob|
|S20|   Cindy Winston|    Cindy|
|l10|Jerry Kyle Moore|    Jerry|
|j31|     Dora Larson|     Dora|
+---+----------------+---------+

How do I continue to split? Appreciated.

1 Answer 1

2

Have the first element in the array always as the firstname and the last element as lastname (using size). If there cannot be more than 1 middle name, you can do:

from pyspark.sql import functions as F
from pyspark.sql.functions import *

df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
    .withColumn("ln", col("split_list")[F.size("split_list") - 1])\
    .withColumn("mn", when(F.size("split_list")==2, None)\
    .otherwise(col("split_list")[1])).drop("split_list").show()
+---+----------------+-----+-------+----+
| id|            name|   fn|     ln|  mn|
+---+----------------+-----+-------+----+
|l03|     Bob K Barry|  Bob|  Barry|   K|
|S20|   Cindy Winston|Cindy|Winston|null|
|l10|Jerry Kyle Moore|Jerry|  Moore|Kyle|
|j31|     Dora Larson| Dora| Larson|null|
+---+----------------+-----+-------+----+

If there can be more than 1 middle name, then you can use substring on name for middlename column:

df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
    .withColumn("ln", col("split_list")[F.size("split_list") - 1])\
    .withColumn("mn", when(F.size("split_list")==2, None)\
    .otherwise(col('name').substr(F.length("fn")+2, \
    F.length("name")-F.length("fn")-F.length("ln")-2))).drop("split_list").show()
+---+----------------+-----+-------+-----+
| id|            name|   fn|     ln|   mn|
+---+----------------+-----+-------+-----+
|l03|     Bob K Barry|  Bob|  Barry|    K|
|S20|   Cindy Winston|Cindy|Winston| null|
|l10|Jerry Kyle Moore|Jerry|  Moore| Kyle|
|j31|     Dora Larson| Dora| Larson| null|
|A12|     Fn A B C Ln|   Fn|     Ln|A B C|
+---+----------------+-----+-------+-----+

I'm assuming that the FN is the first element, and the LN is the last element, and anything in between is the MN. This is not always true as people can have multiple FN/LN.

Sign up to request clarification or add additional context in comments.

1 Comment

I checked my dataset, there is only one middle name if there is a middlename. I tried your method, it is working like a charm, thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.