8

I am new in PySpark . can anyone help me how to read json data using pyspark. what we have done,

(1) main.py

import os.path
from pyspark.sql import SparkSession

def fileNameInput(filename,spark):

    try:
        if(os.path.isfile(filename)):
            loadFileIntoHdfs(filename,spark)
        else:
            print("File does not exists")
    except OSError:
        print("Error while finding file")


def loadFileIntoHdfs(fileName,spark):
    df = spark.read.json(fileName)
    df.show()


if __name__ == '__main__':

    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    file_name = input("Enter file location : ")
    fileNameInput(file_name,spark)

When I run above code it throws error message

 File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column

Thanks in advance

2
  • Please share the JSON content Commented Mar 22, 2018 at 11:44
  • { "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] } Commented Mar 22, 2018 at 11:45

1 Answer 1

20

Your JSON works in my pyspark. I can get a similar error when the record text goes across multiple lines. Please ensure that each record fits in one line. Alternatively, tell it to support multi-line records:

spark.read.json(filename, multiLine=True)

What works:

{ "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }

That outputs:

spark.read.json('/home/ernest/Desktop/brokenjson.json').printSchema()
root
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)

When I try some input like this:

{
  "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }

Then I get the corrupt record in schema:

root
 |-- _corrupt_record: string (nullable = true)

But when used with multiline options, the latter input works too.

Sign up to request clarification or add additional context in comments.

1 Comment

you are my hero, spent ages trying to figure this out - strangely using .option('multiline', 'true') didnt work for me!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.