Hive select rows where string in column A contains string from column B

Question

I have a dataset that contains some rows with invalid entries in one column.

I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.

Pseudo code might look something like this:

SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)

I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.

The python code would look like this:

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

Which would result in something like this:

df

a                         b        c
/bobs/burgers/1234        bobs     idx 
/bobs/burgers/2234        fred     idx
/cindys/sandwhiches/3234  cindy    idx

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

a                         b        c
/bobs/burgers/1234        bobs     idx
/cindys/sandwhiches/3234  cindy    idx

Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?

leftjoin · Accepted Answer · 2018-08-24 05:37:12Z

6

In addition to like and rlike, you can use these functions:

Using instr:

select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;

Using locate:

select distinct b,c from some.log where date = current_date and locate(b, a) > 0;

See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

answered Aug 24, 2018 at 5:37

leftjoin

38.5k8 gold badges64 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

nobody · Accepted Answer · 2018-08-23 21:19:35Z

2

LIKE or RLIKE

select distinct b,c from some.log where date = 'today' and a like concat('%',b,'%');

Or

select distinct b,c from some.log where date = 'today' and a rlike concat('*',b,'*');

edited Aug 23, 2018 at 21:19

answered Aug 23, 2018 at 21:12

nobody

11.1k8 gold badges51 silver badges72 bronze badges

1 Comment

onofricamila Over a year ago

'like' worked for me. Note: if you are looking for an specific substring in a column, just type where column like '%substring%', and it'll do the job

Collectives™ on Stack Overflow

Hive select rows where string in column A contains string from column B

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related