5

I have a big data problem in Hive (SQL).

SELECT genre, COUNT(*) AS unique_count
FROM table_name
GROUP BY genre

which gives result like:

genre           |   unique_count
----------------------------------
Romance,Crime,Drama,Law | 1560
Crime,Drama,Law,Romance | 895
Law,Romance,Crime,Drama | 942
Adventure,Action        | 3250
Action,Adventure        | 910

What I want is to sort the elements in genre ASC|DESC and get results like

genre           |   unique_count
----------------------------------
Crime,Drama,Law,Romance | 3397
Action,Adventure        | 4160

I could do this in Python but I have over 200 Million rows of data. I'm not aware of any reasonable way I can move that data. So how can I achieve this?

2
  • Fix your data structure. Storing lists of things in comma-delimited strings just causes problems. If you didn't know this already, you have now learned why this is a bad idea. Commented Feb 10, 2017 at 20:17
  • 4
    Gordon, I actually do know that but when join a company with existing data, there's little you can do than to massage the messy data. right? Commented Feb 10, 2017 at 20:24

1 Answer 1

11
select      concat_ws(',',sort_array(split(genre,','))) as genre
           ,count(*)                                    as unique_count

from        table_name

group by    concat_ws(',',sort_array(split(genre,',')))
Sign up to request clarification or add additional context in comments.

1 Comment

Prefect. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.