Pyspark: Reading JSON data file with no separator between objects

Question

I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,

{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}

In Apache Spark I am doing this to read the data file,

df = spark.read.schema(schema).json(path, multiLine=True)

This can read only the first json object in the file and the rest neglected because there is no seperator.

How can I use resolve this issue in spark?

Fix the upstream process? Anything you'll do in Spark will be at least somewhat inefficient and ugly. — zero323
– zero323, Commented Jan 12, 2018 at 1:58
makes sense. but i would like to know the rdd based approach to solve this. or if there is any better approach ofcourse. — sjishan
– sjishan, Commented Jan 12, 2018 at 1:59
Off the top of my head: you can use wholTextFiles and parse manually - but it is bad performance wise. You can try to use Hadoop Input format with delimiter if structure is always delimited by }{, and then fix records, but it is hack. You can implement your own input format, but not in Python, and it is a lot of code for such a problem. But honestly - if process is under your control, don't waste time on fixing the symptoms, fix the problem :) — zero323
– zero323, Commented Jan 12, 2018 at 2:04

Anahcolus · Accepted Answer · 2018-01-13 02:16:36Z

9

You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.

sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("\n", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()

you should get dataframe as

+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+

You can modify the code according to your need though

answered Jan 13, 2018 at 2:16

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

DrDEE Over a year ago

Hi, my data is structured as follows, what might you recommend if I wanted restaurant id as values in one column, latitude and longitude in other columns? Thanks! ===> [{"restaurant_id": "1234", "infos": [{"timestamp": "2020-02-03T00:57:26.000Z", "longitude": "-123, "latitude": "456"}{"restaurant_id": "5678", "infos":[{"timestamp": "2....

Collectives™ on Stack Overflow

Pyspark: Reading JSON data file with no separator between objects

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related