3

Have a dataframe, which has a query as a value in one of the column, I am trying to extract the value between one/two parentheses in the first group using regex.

My Input the regex is:

select nvl(sum(field1),0), field2, field3 from tableName1 where partition_date='2018-03-13'

Output should be:

field1

Spark Code what I used to extract the value is:

target_query = "select nvl(sum(field1),0), field2, field3 from tableName1 where partition_date='2018-03-13'"

val df1 = df.withColumn("Extract_field", regexp_extract(df("target_query"),"(?<=\\().*?(?=\\))",0))

But I'm getting

sum(field1

Please let me know how to extract only the field1. I may get sum(field1) instead of nvl(sum(field1),0) as well. Is it possible to use same regex to handle both the cases?

2
  • Will there always only be one sum in the query? Commented Apr 3, 2018 at 6:56
  • yes, the query will have only one sum. Commented Apr 3, 2018 at 7:14

1 Answer 1

3

Since the field1 value always will be inclosed with a sum you can use the following regexp:

sum\\((.*?)\\)

This will match everything enclosed by a sum().

To get the correct match it is necessary that the field you want (field1) is the first sum in the query (since you need to specify the groupid of the match in regexp_extract). In truth, as long as it is in the same position (first, second, etc.) you can correctly match it. For example:

val df1 = df.withColumn("Extract_field", regexp_extract(df("target_query"), "sum\\((.*?)\\)", 1))

The 1 at the end means that everything in the first capture group is extracted.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.