3

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.

drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1
from  vp_nlp_protext_males
;

But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.

drop table vp_hiphop;
create table vp_hiphop as
select  userid, ntext,
        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1
from  vp_nlp_protext_males
;

1 Answer 1

3

You can strip all occurrences of a substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this:

drop table vp_hiphop;
create table vp_hiphop as
select  userid, ntext,
        translate(ntext, ' ', '') as ntext1
from  vp_nlp_protext_males
;
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.