Read Array Of Jsons From File to Spark Dataframe

Question

I have a gzipped JSON file that contains Array of JSON, something like this:

[{"Product":{"id"1,"image":"/img.jpg"},"Color":"black"},{"Product":{"id"2,"image":"/img1.jpg"},"Color":"green"}.....]

I know this is not the ideal data format to read into scala, however there is no other alternative but to process the feed in this manner.

I have tried :

spark.read.json("file-path")

which seems to take a long time (processes very quickly if you have data in MBs, however takes way long for GBs worth of data ), probably because spark is not able to split the file and distribute accross to other executors.

Wanted to see if there is a any way out to preprocess this data and load it into spark context as a dataframe.

Functionality I want seems to be similar to: Create pandas dataframe from json objects . But I wanted to see if there is any scala alternative which could do similar and convert the data to spark RDD / dataframe .

So if you already know that gzip is the issue, what kind of answer do you expect, other than don't use gzip or unpack files first? There is really no magic which will turn gzip into Hadoop / Spark friendly format. — zero323
– zero323, Commented Apr 24, 2018 at 18:19
Agreed with @user6910411 do the splitting outside Spark. Storing raw JSON in HDFS and reading into Spark isn't ideal either. Consider parquet with snappy compression. — maverik
– maverik, Commented Apr 25, 2018 at 21:12
Also possible duplicate of: stackoverflow.com/questions/40492967/… — maverik
– maverik, Commented Apr 25, 2018 at 21:17

wandermonk · Accepted Answer · 2018-04-24 18:44:44Z

1

You can read the "gzip" file using spark.read().text("gzip-file-path"). Since Spark API's are built on top of HDFS API , Spark can read the gzip file and decompress it to read the files.

https://github.com/mesos/spark/blob/baa30fcd99aec83b1b704d7918be6bb78b45fbb5/core/src/main/scala/spark/SparkContext.scala#L239

However, gzip is non-splittable so spark creates an RDD with single partition. Hence, reading gzip files using spark doe not make sense.

You may decompress the gzip file and read the decompressed files to get most out of the distributed processing architecture.

answered Apr 24, 2018 at 18:44

wandermonk

7,5268 gold badges50 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dipayan · Accepted Answer · 2018-04-30 22:07:48Z

1

Appeared like a problem with the data format being given to spark for processing. I had to pre-process the data to change the format to a spark friendly format, and run spark processes over that. This is the preprocessing I ended up doing: https://github.com/dipayan90/bigjsonprocessor/blob/master/src/main/java/com/kajjoy/bigjsonprocessor/Application.java

answered Apr 30, 2018 at 22:07

Dipayan

1854 silver badges13 bronze badges

Collectives™ on Stack Overflow

Read Array Of Jsons From File to Spark Dataframe

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related