As the number of JOINS in Hive query is increasing, the query is running in multiple stages and taking a lot of execution time. How to improve the query performance. Are there any paramters to be set?
3 Answers
First of all large tables should be placed as last one in join order: SELECT small., large. FROM small JOIN large ON small.joinkey=large.joinkey; You can use a hint to tell optimazier which table is biggest:
SELECT/*+ STREAMTABLE(large) */ small.*, large.* FROM large
JOIN small ON small.joinkey=large.joinkey;
Second the small tables could be cached in memory on join by Map-side join:
set hive.auto.convert.join = true;
SELECT a.*, b.* FROM a
JOIN b ON a.joinkey=b.joinkey;
Size of map-join table is set by:
set hive.mapjoin.smalltable.filesize = 1000000;
I hope it helps a bit. GL!
Comments
In addition to the above when the query's SELECT or WHERE clauses does not reference the right table, always good to use left semi join.
The reason semi-joins are more efficient than the more general inner join is as follows. For a given record in the lefthand table, Hive can stop looking for matching records in the righthand table as soon as any match is found. At that point, the selected columns from the lefthand table record can be projected