0

I would like to extract 3 words before the selay dervice but the query returns an empty column :(

with a as (
        select * from tablename1 b 
        where lower(ptranscript) rlike 'selay dervice'
        )
        select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\s+){3}selay dervice',0)  from a

##########update 1

as pointed by Raid earlier, in Hive we cannot use \s and have to use \\s. I updated the above regex accordingly and it works

with a as (
            select * from tablename1 b 
            where lower(ptranscript) rlike 'selay dervice'
            )
            select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\\s+){3}selay dervice',0)  from a
5
  • Do you mean you get no output? Or a wrong piece of text? Commented Dec 16, 2022 at 19:58
  • updated the question and explained the output Commented Dec 16, 2022 at 23:05
  • Can you show a sample of the inputs you get? Without that it's impossible to help you. You regex works as long as those words only have the characters you included. BTW, note you do lower() but then look for A-Z as well. Commented Dec 21, 2022 at 19:17
  • Is this an Azure Databricks DB? From Microsoft's help, the last parameter 0 means returning the whole matched string, not the 3 words you want. To get the 3 at once you may need to add extra parenthesis: (([a-zA-Z0-9]+\s+){3}), as otherwise the groups are the individual words. Testing the regex here works fine. It matches This is a selay dervice, and with the extra parenthesis you get This is a. Commented Dec 21, 2022 at 20:04
  • Another comment: rlike is for regular expressions, it might be faster to use like '%selay dervice'% instead. Following from my comment above, I think you need to use this: regexp_extract(a.ptranscript,'(([a-zA-Z0-9]+\s+){3})selay dervice',1). Commented Dec 21, 2022 at 20:22

1 Answer 1

1
+50

Try below:

with a as (
        select * from tablename1 b 
        where lower(ptranscript) rlike 'selay dervice'
        )
        select *,regexp_extract(lower(a.ptranscript),'(?:[a-zA-Z0-9]+ ){3}selay dervice',0)  from a

Note that if there are less than 3 words before selay dervice you will get empty results.

I tested similar query in latest apache hive and got something like below:

+----------------------------------+-----------------------------+
|               key                |          regex_ext          |
+----------------------------------+-----------------------------+
| rlk1 selay dervice               |                             |
| selay dervice k4                 |                             |
| k5 selay dervice ew              |                             |
| thre word b4 selay dervice       | thre word b4 selay dervice  |
| four word be four selay dervice  | word be four selay dervice  |
+----------------------------------+-----------------------------+

Edit 1: Result does not vary with or without ? All 3 versions below gives same result.

  1. '(?:[a-zA-Z0-9]+ )'
  2. '([a-zA-Z0-9]+ )'
  3. '([a-zA-Z0-9]+\\s)'

As per docs \s matches any white space not just spacebar

Sign up to request clarification or add additional context in comments.

12 Comments

what did you change? what does ?: in regex mean? it just return a column with null in it
@Raid, the user wants to get those 3 words, so adding ?: to make them non-capturing groups does no good IMHO. @user2543622, can you please reply to my comments above?
@Andrew tried but not getting any output :( - just a blank column
@user2543622, can you please carefully read each one of my comments above and reply? Otherwise I can't help you.
No, \s does not work in hive. \\s works
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.