Reading a json file into a RDD (not dataFrame) using pyspark

Question

I have the following file: test.json >

{
    "id": 1,
    "name": "A green door",
    "price": 12.50,
    "tags": ["home", "green"]
}

I want to load this file into a RDD. This is what I tried:

rddj = sc.textFile('test.json')
rdd_res = rddj.map(lambda x: json.loads(x))

I got an error:

Expecting object: line 1 column 1 (char 0)

I don't completely understand what does json.loads do.

How can I resolve this problem ?

Probably a duplicate of stackoverflow.com/questions/39430868/… — Alexandre Dupriez
– Alexandre Dupriez, Commented Oct 30, 2017 at 10:05
The JSON format is not so great for processing with Spark textfile as it will try and process line-by-line, whereas the JSONs cover multiple lines. If you can access your JSON data in the JSON lines format (each json object is "flattened" to a single line, that will work. Alternatively, you can keep the data in the format above and use sc.wholeTextFiles. This returns a key/value rdd, where key is the filename and value is the file content. Then you can process by wrapping the json.loads above into a function which you apply via mapPartitions. — ags29
– ags29, Commented Oct 30, 2017 at 10:07
Possible duplicate of how to read json with schema in spark dataframes/spark sql — Raúl Reguillo Carmona
– Raúl Reguillo Carmona, Commented Oct 30, 2017 at 10:25

Alper t. Turker · Accepted Answer · 2017-10-30 10:02:48Z

3

textFile reads data line by line. Individual lines of your input are not syntactically valid JSON.

Just use json reader:

spark.read.json("test.json", multiLine=True)

or (not recommended) whole text files

sc.wholeTextFiles("test.json").values().map(json.loads)

answered Oct 30, 2017 at 10:02

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Yash Over a year ago

Thanks for your answer. Looks like a fair approach. However, I was using spark 1.6 which does not have the spark module. What worked for me was: rddj = hiveContext.jsonFile("input file path").

Femn Dharamshi Over a year ago

does spark.read.json load data into a RDD or into Dataframe? I have a huge json, approx ~1TB, so it needs to be loaded into a RDD

Adarsh Kumar Over a year ago

@FemnDharamshi It loads into a Dataframe

ZygD Over a year ago

I went for the second option, as the first one creates a dataframe, not an RDD.

Collectives™ on Stack Overflow

Reading a json file into a RDD (not dataFrame) using pyspark

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related