35

I have a test2.json file that contains simple json:

{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}

I have uploaded my file to blob storage and I create a DataFrame from it:

df = spark.read.json("/example/data/test2.json")

then I can see it without any problems:

df.show()
+------+-----------+------+---------+--------------------+
|Author|BlogEntries|Caller|     Name|                 Url|
+------+-----------+------+---------+--------------------+
|jangcy|        100|jangcy|something|https://stackover...|
+------+-----------+------+---------+--------------------+

Second scenario: I have really the same json string declared within my notebook:

newJson = '{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}'

I can print it etc. But now if I'd like to create a DataFrame from it:

df = spark.read.json(newJson)

I get the 'Relative path in absolute URI' error:

'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 249, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'

Should I apply additional transformations to the newJson string? If yes, what should them be? Please forgive me, if this is too trivial, as I am very new to Python and Spark.

I am using Jupyter notebook with PySpark3 Kernel.

Thanks in advance.

1
  • It is apprently part of the "ingestion pipeline" help section, Therefore, renaming the field @ indexing time, not querying time Commented Jan 15, 2020 at 13:25

1 Answer 1

74

You can do the following

newJson = '{"Name":"something","Url":"https://stackoverflow.com","Author":"jangcy","BlogEntries":100,"Caller":"jangcy"}'
df = spark.read.json(sc.parallelize([newJson]))
df.show(truncate=False)

which should give

+------+-----------+------+---------+-------------------------+
|Author|BlogEntries|Caller|Name     |Url                      |
+------+-----------+------+---------+-------------------------+
|jangcy|100        |jangcy|something|https://stackoverflow.com|
+------+-----------+------+---------+-------------------------+
Sign up to request clarification or add additional context in comments.

7 Comments

Thank You very much Ramesh. Works like a charm! :)
I have the same json in dataframe like a column . I am unable to parse it . How can I achieve this. Please help
there is an inbuilt function spark.apache.org/docs/latest/api/java/org/apache/spark/sql/… @NaveenSrikanth
Thanks Ramesh. I am trying work on the same logic. will there be any performance impact on reading 400-500 millions of json's ?
I used spark.read.json(sc.wholeTextFiles("s3a://jsonparser_coding/").values()) , Ramesh . I was able to read successfully :) :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.