I'm automating a data pipeline by using a bash script to move csvs to HDFS and build external Hive tables on them. Currently, this only works when the format of the table is predefined in an .hql file. But I want to be able to read the headers from the CSV and send them as arguments to Hive. So currently I do this inside a loop through the files:
# bash
hive -S -hiveconf VAR1=$target_db -hiveconf VAR2=$filename -hiveconf VAR3=$target_folder/$filename -f create_tables.hql
Which is sent to this...
-- hive
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
individual_pkey INT,
response CHAR(1)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
I want the hive script to look more like this...
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
${hiveconf:ROW1} ${hiveconf:TYPE1},
... ...
${hiveconf:ROW_N} ${hiveconf:TYPE_N}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
Is it possible to send it some kind of array that it would parse? Is this feasible or advisable?