Error while inserting into partitioned hive table for spark scala

Question

I am having hive table with following structure

CREATE TABLE gcganamrswp_work.historical_trend_result( column_name string,
metric_name string,
current_percentage string,
lower_threshold double,
upper_threshold double,
calc_status string,
final_status string,
support_override string,
dataset_name string,
insert_timestamp string,
appid string,
currentdate string,
indicator map<string,string>)
PARTITIONED BY (
appname string,
year_month int)
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");

I am having spark dataframe with schema

root
 |-- metric_name: string (nullable = true)
 |-- column_name: string (nullable = true)
 |-- Lower_Threshold: double (nullable = true)
 |-- Upper_Threshold: double (nullable = true)
 |-- Current_Percentage: double (nullable = true)
 |-- Calc_Status: string (nullable = false)
 |-- Final_Status: string (nullable = false)
 |-- support_override: string (nullable = false)
 |-- Dataset_Name: string (nullable = false)
 |-- insert_timestamp: string (nullable = false)
 |-- appId: string (nullable = false)
 |-- currentDate: string (nullable = false)
 |-- indicator: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = false)
 |-- appname: string (nullable = false)
 |-- year_month: string (nullable = false)

when i try to insert into hive table using below code it is failing

    spark.conf.set("hive.exec.dynamic.partition", "true")
    spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
    data_df.repartition(1)
      .write.mode("append")
      .format("hive")
      .insertInto(Outputhive_table)

Spark Version : Spark 2.4.0

Error:

ERROR Hive:1987 - Exception when loading partition with parameters partPath=hdfs://gcgprod/data/work/hive/historical_trend_result/.hive-staging_hive_2021-09-01_04-34-04_254_8783620706620422928-1/-ext-10000/_temporary/0, table=historical_trend_result, partSpec={appname=, year_month=}, replace=false, listBucketingEnabled=false, isAcid=false, hasFollowingStatsTask=false org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Partition spec is incorrect. {appname=, year_month=}) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662) at org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: MetaException(message:Partition spec is incorrect. {appname=, year_month=}) at org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329) at org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312) at org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)

I have specified the partition columns in the last columns of dataframe, so i expect it consider last tow columns as partition columns. I wanted to used the same routine for inserting different tables so i don't want to mention the partition columns explicitly

maybe because your schema says year_mont int, but dataframe says year_month string? — Raphael Roth
– Raphael Roth, Commented Sep 1, 2021 at 19:38

Vikas Saxena · Accepted Answer · 2021-09-02 05:35:31Z

2

Just to recap that you are using spark to write data to a hive table with dynamic partitions. So my answer below is based on same, if my understanding is incorrect, please feel free to correct me in comment.

While you have a table that is dynamically partitioned (by app_name and year_month), the spark job doesn't know the partitioning fields in the destination so you will still have to tell your spark job about the partitioning field of the destination table.

Something like this should work

data_df.repartition(1)
      .write
      .partitionBy("appname", "year_month")
      .mode(SaveMode.Append)
      .saveAsTable(Outputhive_table)

Make sure that you enable support for dynamic partitions by executing something like

hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

Check out this post by Itai Yaffe, this may be handy https://medium.com/nmc-techblog/spark-dynamic-partition-inserts-part-1-5b66a145974f

edited Sep 2, 2021 at 5:35

answered Sep 1, 2021 at 11:09

Vikas Saxena

1,1581 gold badge13 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Arvinth Over a year ago

I tried this code. I am getting error message - partitionBy() can't be used together with DataFrameWriter.insertInto

Vikas Saxena Over a year ago

What is the version of spark that you are running ?

Vikas Saxena Over a year ago

@Arvinth, i have modified my answer for older version of spark, this should work

Arvinth Over a year ago

I am getting error after trying updated code:

org.apache.spark.sql.AnalysisException: partition column appname,year_month is not defined in table gcganamrswp_work.historical_trend_result, defined table columns are: column_name, metric_name, current_percentage, lower_threshold, upper_threshold, calc_status, final_status, support_override, dataset_name, insert_timestamp, appid, currentdate, indicator, appname, year_month;

Vikas Saxena Over a year ago

@arvinth you are writing to the wrong table. From your question above, the variable Outputhive_table should be set to gcgcrsnapsd_equifax_t_db.historical_trend_result while from the error message it seems to be set to gcganamrswp_work.historical_trend_result

|

Raphael Roth · Accepted Answer · 2021-09-01 19:42:31Z

1

I think the problem is that some records have appname and year_month as strings. At least this is suggested by

Partition spec is incorrect. {appname=, year_month=}

Make sure partition colums are never empty or null! Also note that the type of year_month is not consistent between the DataFrame and your schema (string/int)

answered Sep 1, 2021 at 19:42

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

1 Comment

Arvinth Over a year ago

I checked the partition columns are not empty or null. Also, i changed the datatype of year_month to int. But still getting same error. Please help

Collectives™ on Stack Overflow

Error while inserting into partitioned hive table for spark scala

2 Answers 2

9 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related