Java read from json file using Apache Spark specifying the Schema

Question

I have some json file with such format:

{"_t":1480647647,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1480647676,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1483161958,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1483162393,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1483499947,"_p":"[email protected]","_n":"app_loaded","device_type":"desktop"}
{"_t":1505361824,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505362047,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505362372,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505362854,"_p":"[email protected]","_n":"added_to_team","account":"1234"}
{"_t":1505366071,"_p":"[email protected]","_n":"added_to_team","account":"1234"}

I'm using Apache Spark in my java application in order to read this json file and save to parquet format.

If I didn't use schema definition then there is no problem with file parsing There is my code example:

Dataset<Row> dataset = spark.read().json(pathToFile);
dataset.show(100);

And there is my console output:

+-------------+------------------+----------+-------+-------+-----------+
|           _n|                _p|        _t|account|channel|device_type|
+-------------+------------------+----------+-------+-------+-----------+
|   app_loaded| [email protected]|1480647647|   null|   null|    desktop|
|   app_loaded| [email protected]|1480647676|   null|   null|    desktop|
|   app_loaded| [email protected]|1483161958|   null|   null|    desktop|
|   app_loaded| [email protected]|1483162393|   null|   null|    desktop|
|   app_loaded| [email protected]|1483499947|   null|   null|    desktop|
|added_to_team|   [email protected]|1505361824|   1234|   null|       null|
|added_to_team|    [email protected]|1505362047|   1234|   null|       null|
...

When I'm using schema definition like this

StructType schema = new StructType();
schema.add("_n", StringType, true);
schema.add("_p", StringType, true);
schema.add("_t", TimestampType, true);
schema.add("account", StringType, true);
schema.add("channel", StringType, true);
schema.add("device_type", StringType, true);
// Read data from file
Dataset<Row> dataset = spark.read().schema(schema).json(pathToFile);
dataset.show(100);

I got console output :

++
||
++
||
||
||
||
...

What's wrong with schma definition?

Alper t. Turker · Accepted Answer · 2018-02-09 20:17:45Z

1

StrutType is immutable, so just discard all additions. If you print it

schema.printTreeString

you'll see it doesn't contain any field:

root

You should use:

StructType schema = new StructType()
  .add("_n", StringType, true)
  .add("_p", StringType, true)
  .add("_t", TimestampType, true)
  .add("account", StringType, true)
  .add("channel", StringType, true)
  .add("device_type", StringType, true);

answered Feb 9, 2018 at 20:17

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Java read from json file using Apache Spark specifying the Schema

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related