2

As the number of JOINS in Hive query is increasing, the query is running in multiple stages and taking a lot of execution time. How to improve the query performance. Are there any paramters to be set?

3 Answers 3

4

First of all large tables should be placed as last one in join order: SELECT small., large. FROM small JOIN large ON small.joinkey=large.joinkey; You can use a hint to tell optimazier which table is biggest:

SELECT/*+ STREAMTABLE(large) */ small.*, large.* FROM large
JOIN small ON small.joinkey=large.joinkey;

Second the small tables could be cached in memory on join by Map-side join:

set hive.auto.convert.join = true;
SELECT a.*, b.* FROM a
JOIN b ON a.joinkey=b.joinkey;

Size of map-join table is set by:

set hive.mapjoin.smalltable.filesize = 1000000; 

I hope it helps a bit. GL!

Sign up to request clarification or add additional context in comments.

Comments

0

In addition to the above when the query's SELECT or WHERE clauses does not reference the right table, always good to use left semi join.

The reason semi-joins are more efficient than the more general inner join is as follows. For a given record in the lefthand table, Hive can stop looking for matching records in the righthand table as soon as any match is found. At that point, the selected columns from the lefthand table record can be projected

Comments

0
set hive.exec.parallel = True

this is general and using appropriate set commands we can optimize the query which is more considerable based on your cluster config.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.