7

I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,

{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}

In Apache Spark I am doing this to read the data file,

df = spark.read.schema(schema).json(path, multiLine=True)

This can read only the first json object in the file and the rest neglected because there is no seperator.

How can I use resolve this issue in spark?

3
  • Fix the upstream process? Anything you'll do in Spark will be at least somewhat inefficient and ugly. Commented Jan 12, 2018 at 1:58
  • makes sense. but i would like to know the rdd based approach to solve this. or if there is any better approach ofcourse. Commented Jan 12, 2018 at 1:59
  • 2
    Off the top of my head: you can use wholTextFiles and parse manually - but it is bad performance wise. You can try to use Hadoop Input format with delimiter if structure is always delimited by }{, and then fix records, but it is hack. You can implement your own input format, but not in Python, and it is a lot of code for such a problem. But honestly - if process is under your control, don't waste time on fixing the symptoms, fix the problem :) Commented Jan 12, 2018 at 2:04

1 Answer 1

9

You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.

sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("\n", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()

you should get dataframe as

+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+

You can modify the code according to your need though

Sign up to request clarification or add additional context in comments.

1 Comment

Hi, my data is structured as follows, what might you recommend if I wanted restaurant id as values in one column, latitude and longitude in other columns? Thanks! ===> [{"restaurant_id": "1234", "infos": [{"timestamp": "2020-02-03T00:57:26.000Z", "longitude": "-123, "latitude": "456"}{"restaurant_id": "5678", "infos":[{"timestamp": "2....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.