2

I have a dataset that contains some rows with invalid entries in one column.

I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.

Pseudo code might look something like this:

SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)

I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.

The python code would look like this:

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

Which would result in something like this:

df

a                         b        c
/bobs/burgers/1234        bobs     idx 
/bobs/burgers/2234        fred     idx
/cindys/sandwhiches/3234  cindy    idx

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

a                         b        c
/bobs/burgers/1234        bobs     idx
/cindys/sandwhiches/3234  cindy    idx

Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?

2 Answers 2

6

In addition to like and rlike, you can use these functions:

Using instr:

select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;

Using locate:

select distinct b,c from some.log where date = current_date and locate(b, a) > 0;

See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Sign up to request clarification or add additional context in comments.

Comments

2

LIKE or RLIKE

select distinct b,c from some.log where date = 'today' and a like concat('%',b,'%');

Or

select distinct b,c from some.log where date = 'today' and a rlike concat('*',b,'*');

1 Comment

'like' worked for me. Note: if you are looking for an specific substring in a column, just type where column like '%substring%', and it'll do the job

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.