I have a dataset that contains some rows with invalid entries in one column.
I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.
Pseudo code might look something like this:
SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)
I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.
The python code would look like this:
df = df[df.apply(lambda x: x.name in x.url, axis=1)]
Which would result in something like this:
df
a b c
/bobs/burgers/1234 bobs idx
/bobs/burgers/2234 fred idx
/cindys/sandwhiches/3234 cindy idx
df = df[df.apply(lambda x: x.name in x.url, axis=1)]
a b c
/bobs/burgers/1234 bobs idx
/cindys/sandwhiches/3234 cindy idx
Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?