How to apply string operation on Spark DataFrame in Java

Question

I have a Spark DataFrame, which looks like this:

+--------------------+------+----------------+-----+--------+
|         Name       |   Sex|        Ticket  |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Braund, Mr. Owen ...|  male|       A/5 21171| null|       S|
|Cumings, Mrs. Joh...|female|        PC 17599|  C85|       C|
|Heikkinen, Miss. ...|female|STON/O2. 3101282| null|       S|
|Futrelle, Mrs. Ja...|female|          113803| C123|       S|
|Palsson, Master. ...|  male|          349909| null|       S|
+--------------------+------+----------------+-----+--------+

Now I need to filter the 'Name' column such that it contains only the title -i.e. Mr., Mrs., Miss., Master. So the resulting column would be:

+--------------------+------+----------------+-----+--------+
|         Name       |   Sex|        Ticket  |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Mr.                 |  male|       A/5 21171| null|       S|
|Mrs.                |female|        PC 17599|  C85|       C|
|Miss.               |female|STON/O2. 3101282| null|       S|
|Mrs.                |female|          113803| C123|       S|
|Master.             |  male|          349909| null|       S|
+--------------------+------+----------------+-----+--------+

I tried to apply sub-string operation:

List<String> list = Arrays.asList("Mr.","Mrs.", "Mrs.","Master.");
Dataset<Row> categoricalDF2 = categoricalDF.filter(col("Name").isin(list.stream().toArray(String[]::new)));

but it seems it's not that easy in Java. How can do it in Java. Please note that I'm using Spark 2.2.0.

Martin · Accepted Answer · 2018-03-24 13:39:21Z

Finally, managed to solve it and got the answer to my own question. I have extended Mohit's answer with an UDF instead:

private static final UDF1<String, Option<String>> getTitle = (String name) ->      {
    if (name.contains("Mr.")) { // If it has Mr.
        return Some.apply("Mr.");
    } else if (name.contains("Mrs.")) { // Or if has Mrs.
        return Some.apply("Mrs.");
    } else if (name.contains("Miss.")) { // Or if has Miss.
        return Some.apply("Miss.");
    } else if (name.contains("Master.")) { // Or if has Master.
        return Some.apply("Master.");
    } else { // Not any.
        return Some.apply("Untitled");
    }
};

Then I had to register the preceding UDF as follows:

SparkSession spark = SparkSession.builder().master("local[*]")
                    .config("spark.sql.warehouse.dir", "/home/martin/")
                    .appName("Titanic")
                    .getOrCreate();
Dataset<Row> df = ....    
spark.sqlContext().udf().register("getTitle", getTitle, DataTypes.StringType);
Dataset<Row> categoricalDF = df.select(callUDF("getTitle", col("Name")).alias("Name"), col("Sex"), col("Ticket"), col("Cabin"), col("Embarked"));
categoricalDF.show();

The preceding code produces the following output:

+-----+------+----------------+-----+--------+
| Name|   Sex|          Ticket|Cabin|Embarked|
+-----+------+----------------+-----+--------+
|  Mr.|  male|       A/5 21171| null|       S|
| Mrs.|female|        PC 17599|  C85|       C|
|Miss.|female|STON/O2. 3101282| null|       S|
| Mrs.|female|          113803| C123|       S|
|  Mr.|  male|          373450| null|       S|
+-----+------+----------------+-----+--------+
only showing top 5 rows

Mohit · Accepted Answer · 2018-03-24 12:42:05Z

0

I think that the following code would be sufficient for this piece of work.

public class SomeClass {
...

    /**
     * Return the title of the name.
     */
    public String getTitle(String name) {
        if (name.contains("Mr.")) { // If it has Mr.
            return "Mr.";
        } else if (name.contains("Mrs.")) { // Or if has Mrs.
            return "Mrs.";
        } else if (name.contains("Miss.")) { // Or if has Miss.
            return "Miss.";
        } else if (name.contains("Master.")) { // Or if has Master.
            return "Master.";
        } else { // Not any.
            return "Untitled";
        }
    }
}

answered Mar 24, 2018 at 12:42

Mohit

1,32513 silver badges29 bronze badges

1 Comment

Martin Over a year ago

Hey, thank's for your answer. However, I need to apply the similar operation to the DataFrame column, not on the normal string! Does UDF help on this?

Collectives™ on Stack Overflow

How to apply string operation on Spark DataFrame in Java

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related