0

I have a Spark DataFrame, which looks like this:

+--------------------+------+----------------+-----+--------+
|         Name       |   Sex|        Ticket  |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Braund, Mr. Owen ...|  male|       A/5 21171| null|       S|
|Cumings, Mrs. Joh...|female|        PC 17599|  C85|       C|
|Heikkinen, Miss. ...|female|STON/O2. 3101282| null|       S|
|Futrelle, Mrs. Ja...|female|          113803| C123|       S|
|Palsson, Master. ...|  male|          349909| null|       S|
+--------------------+------+----------------+-----+--------+

Now I need to filter the 'Name' column such that it contains only the title -i.e. Mr., Mrs., Miss., Master. So the resulting column would be:

+--------------------+------+----------------+-----+--------+
|         Name       |   Sex|        Ticket  |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Mr.                 |  male|       A/5 21171| null|       S|
|Mrs.                |female|        PC 17599|  C85|       C|
|Miss.               |female|STON/O2. 3101282| null|       S|
|Mrs.                |female|          113803| C123|       S|
|Master.             |  male|          349909| null|       S|
+--------------------+------+----------------+-----+--------+

I tried to apply sub-string operation:

List<String> list = Arrays.asList("Mr.","Mrs.", "Mrs.","Master.");
Dataset<Row> categoricalDF2 = categoricalDF.filter(col("Name").isin(list.stream().toArray(String[]::new)));

but it seems it's not that easy in Java. How can do it in Java. Please note that I'm using Spark 2.2.0.

2 Answers 2

1

Finally, managed to solve it and got the answer to my own question. I have extended Mohit's answer with an UDF instead:

private static final UDF1<String, Option<String>> getTitle = (String name) ->      {
    if (name.contains("Mr.")) { // If it has Mr.
        return Some.apply("Mr.");
    } else if (name.contains("Mrs.")) { // Or if has Mrs.
        return Some.apply("Mrs.");
    } else if (name.contains("Miss.")) { // Or if has Miss.
        return Some.apply("Miss.");
    } else if (name.contains("Master.")) { // Or if has Master.
        return Some.apply("Master.");
    } else { // Not any.
        return Some.apply("Untitled");
    }
};

Then I had to register the preceding UDF as follows:

SparkSession spark = SparkSession.builder().master("local[*]")
                    .config("spark.sql.warehouse.dir", "/home/martin/")
                    .appName("Titanic")
                    .getOrCreate();
Dataset<Row> df = ....    
spark.sqlContext().udf().register("getTitle", getTitle, DataTypes.StringType);
Dataset<Row> categoricalDF = df.select(callUDF("getTitle", col("Name")).alias("Name"), col("Sex"), col("Ticket"), col("Cabin"), col("Embarked"));
categoricalDF.show();

The preceding code produces the following output:

+-----+------+----------------+-----+--------+
| Name|   Sex|          Ticket|Cabin|Embarked|
+-----+------+----------------+-----+--------+
|  Mr.|  male|       A/5 21171| null|       S|
| Mrs.|female|        PC 17599|  C85|       C|
|Miss.|female|STON/O2. 3101282| null|       S|
| Mrs.|female|          113803| C123|       S|
|  Mr.|  male|          373450| null|       S|
+-----+------+----------------+-----+--------+
only showing top 5 rows
Sign up to request clarification or add additional context in comments.

Comments

0

I think that the following code would be sufficient for this piece of work.

public class SomeClass {
...

    /**
     * Return the title of the name.
     */
    public String getTitle(String name) {
        if (name.contains("Mr.")) { // If it has Mr.
            return "Mr.";
        } else if (name.contains("Mrs.")) { // Or if has Mrs.
            return "Mrs.";
        } else if (name.contains("Miss.")) { // Or if has Miss.
            return "Miss.";
        } else if (name.contains("Master.")) { // Or if has Master.
            return "Master.";
        } else { // Not any.
            return "Untitled";
        }
    }
}

1 Comment

Hey, thank's for your answer. However, I need to apply the similar operation to the DataFrame column, not on the normal string! Does UDF help on this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.