Java SparkSession Hive SQL is not applying regexp_replace

Question

I have a working Spark application executing hive queries.

With new requirements, I need to remove all whitespaces from the selected key.

According to Apache documentation regexp_replace is suitable for my case:

regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT) Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. > For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

Running this:

public class SparkSql {

    private SparkSession session = SparkSession.builder()
            .appName("hive-sql")
            .config("spark.config.option", "configuration")
            .enableHiveSupport()
            .getOrCreate();

    // Omitted code here ...

    public void execute() {
        Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\s+', ''") as key from master_table);
        JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);

        for (Row row : rdd.collect())
            System.out.println(row.getString(row.fieldIndex("key")));
    }
}

Output:

ABCD 100000

Expected:

ABCD100000

For some reason, regexp_replace was not applied. What could be the reason for this?

J-Alex · Accepted Answer · 2019-08-23 12:29:24Z

The first attempt to find the reason was to check if query is runnable in other environments.

Hive Shell returned expected result for select regexp_replace(master_key, '\\s+', '').

\ is an escape character and if the hive shell requires one escape character, using this expression as Java String will require one more escape character to pass \ to SparkSession's SQL parser.

So, this Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\s+', ''") as key from master_table); wil actually pass \s+ to SQL parser:

public void execute() {
    Dataset<Row> dataset = session.sql("select regexp_replace("test", '\\s+', ''") as key from master_table);
    JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);

    for (Row row : rdd.collect())
        System.out.println(row.getString(row.fieldIndex("key")));
}

Output:

test

To pass \\s+ to SparkSession's SQL parser we need to add one escape \ character per \:

public void execute() {
    Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\\\s+', ''") as key from master_table);
    JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);

    for (Row row : rdd.collect())
        System.out.println(row.getString(row.fieldIndex("key")));
}

Output:

ABCD100000

Collectives™ on Stack Overflow

Java SparkSession Hive SQL is not applying regexp_replace

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related