3

I have the following file: test.json >

{
    "id": 1,
    "name": "A green door",
    "price": 12.50,
    "tags": ["home", "green"]
}

I want to load this file into a RDD. This is what I tried:

rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))

I got an error:

Expecting object: line 1 column 1 (char 0)

I don't completely understand what does json.loads do.

How can I resolve this problem ?

4
  • Probably a duplicate of stackoverflow.com/questions/39430868/… Commented Oct 30, 2017 at 10:05
  • The JSON format is not so great for processing with Spark textfile as it will try and process line-by-line, whereas the JSONs cover multiple lines. If you can access your JSON data in the JSON lines format (each json object is "flattened" to a single line, that will work. Alternatively, you can keep the data in the format above and use sc.wholeTextFiles. This returns a key/value rdd, where key is the filename and value is the file content. Then you can process by wrapping the json.loads above into a function which you apply via mapPartitions. Commented Oct 30, 2017 at 10:07
  • Possible duplicate of how to read json with schema in spark dataframes/spark sql Commented Oct 30, 2017 at 10:25
  • This is actually not a dupe. Commented Oct 31, 2017 at 9:12

1 Answer 1

3

textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.

Just use json reader:

spark.read.json("test.json", multiLine=True)

or (not recommended) whole text files

sc.wholeTextFiles("test.json").values().map(json.loads)
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer. Looks like a fair approach. However, I was using spark 1.6 which does not have the spark module. What worked for me was: rddj = hiveContext.jsonFile("input file path").
does spark.read.json load data into a RDD or into Dataframe? I have a huge json, approx ~1TB, so it needs to be loaded into a RDD
@FemnDharamshi It loads into a Dataframe
I went for the second option, as the first one creates a dataframe, not an RDD.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.