1

I have a working Spark application executing hive queries.

With new requirements, I need to remove all whitespaces from the selected key.

According to Apache documentation regexp_replace is suitable for my case:

regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT) Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. > For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

Running this:

public class SparkSql {

    private SparkSession session = SparkSession.builder()
            .appName("hive-sql")
            .config("spark.config.option", "configuration")
            .enableHiveSupport()
            .getOrCreate();

    // Omitted code here ...

    public void execute() {
        Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\s+', ''") as key from master_table);
        JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);

        for (Row row : rdd.collect())
            System.out.println(row.getString(row.fieldIndex("key")));
    }
}

Output:

ABCD 100000

Expected:

ABCD100000

For some reason, regexp_replace was not applied. What could be the reason for this?

1 Answer 1

3

The first attempt to find the reason was to check if query is runnable in other environments.

Hive Shell returned expected result for select regexp_replace(master_key, '\\s+', '').

\ is an escape character and if the hive shell requires one escape character, using this expression as Java String will require one more escape character to pass \ to SparkSession's SQL parser.

So, this Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\s+', ''") as key from master_table); wil actually pass \s+ to SQL parser:

public void execute() {
    Dataset<Row> dataset = session.sql("select regexp_replace("test", '\\s+', ''") as key from master_table);
    JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);

    for (Row row : rdd.collect())
        System.out.println(row.getString(row.fieldIndex("key")));
}

Output:

test

To pass \\s+ to SparkSession's SQL parser we need to add one escape \ character per \:

public void execute() {
    Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\\\s+', ''") as key from master_table);
    JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);

    for (Row row : rdd.collect())
        System.out.println(row.getString(row.fieldIndex("key")));
}

Output:

ABCD100000
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.