2

I have a file which contains the data as follows

[{"col1":"col1","col2":1}
,{"col1":"col11","col2":11}
,{"col1":"col111","col2":2}
]

I am trying to load the table in Hive.

I am using following Hive serde

CREATE EXTERNAL TABLE my_table (
      my_array ARRAY<struct<col1:string,col2:int>>
)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";

I am getting error when I try to run select * after running the create command -

['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:499', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:307', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:878', 'sun.reflect.GeneratedMethodAccessor29:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy35:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:559', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:751', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1717', 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1702', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected:29:4', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:521', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:428', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:147', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2207', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:494', '*org.apache.hadoop.hive.serde2.SerDeException:java.io.IOException: Start token not found where expected:30:1', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:184', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:502', '*java.io.IOException:Start token not found where expected:30:0', 'org.apache.hive.hcatalog.data.JsonSerDe:deserialize:JsonSerDe.java:170'], statusCode=3), results=None, hasMoreRows=None)

I tried several things, none of which worked as expected. I can't change the input data format as it is someone else who is providing the data.

3 Answers 3

1

This is a malformed JSON issue. A JSON file will always have "curly braces" at the beginning and the end. So change your JSON file to look something like below.

{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}

Create your table in the exact same way as you are doing it already.

CREATE EXTERNAL TABLE my_table 
(
      my_array ARRAY<struct<col1:string,col2:int>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( "ignore.malformed.json" = "true")
LOCATION "MY_LOCATION";

Now fire a select * on your newly created table to see following results.

[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]

Use select my_array.col1 from my_table; to see the values for col1 from your array.

["col1","col11","col111"]


PS - Not the most efficient way to store the data. Consider transforming the data and storing it as ORC/Parquet.


Hope that helps!

Sign up to request clarification or add additional context in comments.

2 Comments

The last query should be select my_array.col1 from my_table;
Thanks for pointing out, Manoj. I have updated the answer to include the array column name.
0

Looks like the issue is with your json data. Can you try with below example?

Create employee json with below content and place it in hdfs.

[root@quickstart spark]# hadoop fs -cat /user/cloudera/spark/employeejson/*
{"Name":"Vinayak","age":35}
{"Name":"Nilesh","age":37}
{"Name":"Raju","age":30}
{"Name":"Karthik","age":28}
{"Name":"Shreshta","age":1}
{"Name":"Siddhish","age":2}

Add below jar(execute only if you get any error. )

hive> ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;

hive> 
CREATE TABLE employeefromjson(name string, age int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
LOCATION '/user/cloudera/hive/employeefromjson'
;

hive> LOAD DATA INPATH '/user/cloudera/spark/employeejson' OVERWRITE INTO TABLE employeefromjson;

hive> select * from employeefromjson;
OK
Vinayak 35
Nilesh  37
Raju    30
Karthik 28
Shreshta    1
Siddhish    2
Time taken: 0.174 seconds, Fetched: 6 row(s)

1 Comment

The example you are referring to is json row one per line. The data I have shared is a json array. Both the datasets look correct to me. I am able to read json like yours, but failing in reading json array structure like I posted.
0

A JSON should always start with '{' and not with '['. That is the problem. As you know, JSON has a structure of {'key':'value'}. What you have given in your file is a value which does not have any key. So, change your JSON to the below formmat

{"my_array":[{"col1":"col1","col2":1},{"col1":"col11","col2":11},{"col1":"col111","col2":2}]}

Your Create table statement should work fine.

If you want to get the data for each column for all the rows, use the below query.

select my_array.col1, my_array.col2 from my_table;

The above command will give you the below result.

 OK
["col1","col11","col111"]       [1,11,2]

If you want to get the result column wise for each row seperately, use the below query.

select a.* from my_table m lateral view outer inline (m.my_array) a;

The above command will give you the below result.

OK
col1    1
col11   11
col111  2

Hope you this helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.