2

I am reading the contents of an api into a dataframe using the pyspark code below in a databricks notebook. I validated the json payload and the string is in valid json format. I guess the error is due to multiline json string. The below code worked fine with other json api payloads.

Spark version < 2.2

import requests
user = "usr"
password = "aBc!23"
response = requests.get('https://myapi.com/allcolor', auth=(user, password))
jsondata = response.json()
from pyspark.sql import *
df = spark.read.json(sc.parallelize([jsondata]))
df.show()

JSON payload:

{
  "colors": [
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    },
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    },
    {
      "color": "red",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          0,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "blue",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          0,
          0,
          255,
          1
        ],
        "hex": "#00F"
      }
    },
    {
      "color": "yellow",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          0,
          1
        ],
        "hex": "#FF0"
      }
    },
    {
      "color": "green",
      "category": "hue",
      "type": "secondary",
      "code": {
        "rgba": [
          0,
          255,
          0,
          1
        ],
        "hex": "#0F0"
      }
    }
  ]
}

Error:

pyspark.sql.dataframe.DataFrame = [_corrupt_record: string]

Modified code:

spark.sql("set spart.databricks.delta.preview.enabled=true")
spark.sql("set spart.databricks.delta.retentionDutationCheck.preview.enabled=false")
import json
import requests
from requests.auth import HTTPDigestAuth
import pandas as pd
user = "username"
password = "password"
myResponse = requests.get('https://myapi.com/allcolor', auth=(user, password))
if(myResponse.ok):
  jData = json.loads(myResponse.content)
  s1 = json.dumps(jData)
  #load data from api
  x = json.loads(s1)
  data = pd.read_json(json.dumps(x))
  #create dataframe
  spark_df = spark.createDataFrame(data)
  spark_df.show()          
  spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net","<your-storage-account-access-key>")
  spark_df.write.mode("overwrite").json("wasbs://<container>@<storage-account-name>.blob.core.windows.net/<directory>/")
else:
  myResponse.raise_for_status()

Output not in right format as the source.

Output modified: (not same as source)

{
  "colors": 
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [
          255,
          255,
          255,
          1
        ],
        "hex": "#000"
      }
    }
    }
{
  "colors":     
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [
          0,
          0,
          0,
          1
        ],
        "hex": "#FFF"
      }
    }
    }

Could you please point to me where I am going wrong as the output file I am storing in ADLS Gen2 does not match the source api json payload.

1
  • 1
    Tried using spark.read.option("multiline", "true").json(sc.parallelize([data])) throws same error. Commented Mar 10, 2021 at 0:25

1 Answer 1

2

Remove new lines before calling spark.read.json:

df = spark.read.json(sc.parallelize([jsondata.replace('\n','')]))

df.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|colors                                                                                                                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[hue, [#000, [255, 255, 255, 1]], black, primary], [value, [#FFF, [0, 0, 0, 1]], white,], [hue, [#FF0, [255, 0, 0, 1]], red, primary], [hue, [#00F, [0, 0, 255, 1]], blue, primary], [hue, [#FF0, [255, 255, 0, 1]], yellow, primary], [hue, [#0F0, [0, 255, 0, 1]], green, secondary]]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

df.printSchema()
root
 |-- colors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- code: struct (nullable = true)
 |    |    |    |-- hex: string (nullable = true)
 |    |    |    |-- rgba: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- color: string (nullable = true)
 |    |    |-- type: string (nullable = true)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @mck for your response. I am not sure what might be causing this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.