I have a working Spark application executing hive queries.
With new requirements, I need to remove all whitespaces from the selected key.
According to Apache documentation regexp_replace is suitable for my case:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. > For example,regexp_replace("foobar", "oo|ar", "")returns'fb.'Note that some care is necessary in using predefined character classes: using'\s'as the second argument will match the letter s;'\\s'is necessary to match whitespace, etc.
Running this:
public class SparkSql {
private SparkSession session = SparkSession.builder()
.appName("hive-sql")
.config("spark.config.option", "configuration")
.enableHiveSupport()
.getOrCreate();
// Omitted code here ...
public void execute() {
Dataset<Row> dataset = session.sql("select regexp_replace(master_key, '\\s+', ''") as key from master_table);
JavaRDD<Row> rdd = context.parallelize(dataset.collectAsList(), factor);
for (Row row : rdd.collect())
System.out.println(row.getString(row.fieldIndex("key")));
}
}
Output:
ABCD 100000
Expected:
ABCD100000
For some reason, regexp_replace was not applied.
What could be the reason for this?