0

I am trying to insert data into Hive table like this:

val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
val partSchema = StructType(Array(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("salary",IntegerType,true),StructField("dept",StringType,true),StructField("location",StringType,true)))
val partRDD = partdata.map(p => Row(p(0).toInt,p(1),p(2).toInt,p(3),p(4)))
val partDF = sqlContext.createDataFrame(partRDD, partSchema)

Packages I imported:

import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
import org.apache.spark.sql.types._

This is how I tried to insert the dataframe into Hive partition:

partDF.write.mode(saveMode.Append).partitionBy("location").insertInto("parttab")

Im getting the below error even though I have the Hive Table:

org.apache.spark.sql.AnalysisException: Table not found: parttab;

Could anyone tell me what is the mistake I am doing here and how can I correct it ?

0

1 Answer 1

1

To write data to Hive warehouse, you need to initialize hiveContext instance.

Upon doing that, it will take confs from Hive-Site.xml (from classpath); and connects to underlying Hive warehouse.

HiveContext is an extension to SQLContext to support and connect to hive.

To do so, try this::

val hc = new HiveContext(sc)

And perform your append-query onn this instance.

partDF.registerAsTempTable("temp")

hc.sql(".... <normal sql query to pick data from table `temp`; and insert in to Hive table > ....")

Please make sure that the table parttab is under db - default.

If the table in under another db, table name should be specified as : <db-name>.parttab

If you need to directly save the dataframe in to hive; use this:

df.saveAsTable("<db-name>.parttab")
Sign up to request clarification or add additional context in comments.

3 Comments

could you tell where do you specify the dataframe here ?
I tried it like this: scala> hc.sql("insert into parttab partition(location = 'India') select id,name,salary,dept,location from ptab"). getting error: Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@25ac587b, org.apache.derby.iapi.error.StandardException.newException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown Source) Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/cloudera/metastore_db.
As you're running in spark shell, you shouldn't instantiate a HiveContext with instance name hc, there's one created automatically called sqlContext. (the name is misleading - if you compiled Spark with Hive, it will be a HiveContext). See similar discussion here: https://issues.apache.org/jira/browse/SPARK-9776.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.