1

(Apache Spark Version 2.3.1 on Databricks)

Hello I have a JSON dump that looks like this

[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]

I am trying to convert it to a dataframe directly from a variable instead of a JSON file upload; mainly because I get the JSON data from a GET request to an API.

This is my code for conversion -

countries = spark.read.option("multiline", "true").json(json.dumps(ts)).show(false)

Gives me this error, please point me in the right direction. I checked around, but I just see solutions for Scala. Looking for a Python fix to the same.

IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: "[{\"standings\":%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C%22pitch%5C%22:%20null,%20%5C%22commentaries%5C%22:%20null,%20%5C%22id%5C%22:%2010342083,%20%5C%22venue_id%5C%22:%20273277,%20%5C%22formations%5C%22:%20%7B%5C%22localteam_formation%5C%22:%20null,%20%5C%22visitorteam_formation%5C%22:%20null%7D,%20%5C%22aggregate_id%5C%22:%20null,%20%5C%22round_id%5C%22:%20null,%20%5C%22visitorteam_id%5C%22:%2018647,%20%5C%22winning_odds_calculated%5C%22:%20false,%20%5C%22deleted%5C%22:%20false,%20%5C%22coaches%5C%22:%20%7B%5C%22localteam_coach_id%5C%22:%20472158,%20%5C%22visitorteam_coach_id%5C%22:%20474616%7D,%20%5C%22attendance%5C%22:%20null,%20%5C%22scores%5C%22:%20%7B%5C%22ft_score%5C%22:%20null,%20%5C%22visitorteam_score%5C%22:%200,%20%5C%22et_score%5C%22:%20null,%20%5C%22localteam_pen_score%5C%22:%20null,%20%5C%22visitorteam_pen_score%5C%22:%20null,%20%5C%22localteam_score%5C%22:%200,%20%5C%22ht_score%5C%22:%20null%7D,%20%5C%22referee_id%5C%22:%2018783,%20%5C%22stage_id%5C%22:%201728,%20%5C%22weather_report%5C%22:%20null,%20%5C%22league_id%5C%22:%20732,%20%5C%22localteam_id%5C%22:%2015251,%20%5C%22time%5C%22:%20%7B%5C%22status%5C%22:%20%5C%22NS%5C%22,%20%5C%22starting_at%5C%22:%20%7B%5C%22date%5C%22:%20%5C%222018-07-06%5C%22,%20%5C%22date_time%5C%22:%20%5C%222018-07-06%2014:00:00%5C%22,%20%5C%22timezone%5C%22:%20%5C%22UTC%5C%22,%20%5C%22timestamp%5C%22:%201530885600,%20%5C%22time%5C%22:%20%5C%2214:00:00%5C%22%7D,%20%5C%22extra_minute%5C%22:%20null,%20%5C%22injury_time%5C%22:%20null,%20%5C%22second%5C%22:%20null,%20%5C%22added_time%5C%22:%20null,%20%5C%22minute%5C%22:%20null%7D,%20%5C%22group_id%5C%22:%20null%7D,%20%7B%5C%22standings%5C%22:%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C%22pitch%5C%22:%20null,%20%5C%22commentaries%5C%22:%20null,%20%5C%22id%5C%22:%2010344350,%20%5C%22venue_id%5C%22:%208869,%20%5C%22formations%5C%22:%20%7B%5C%22localteam_formation%5C%22:%20null,%20%5C%22visitorteam_formation%5C%22:%20null%7D,%20%5C%22aggregate_id%5C%22:%20null,%20%5C%22round_id%5C%22:%20null,%20%5C%22visitorteam_id%5C%22:%2018743,%20%5C%22winning_odds_calculated%5C%22:%20false,%20%5C%22deleted%5C%22:%20false,%20%5C%22coaches%5C%22:%20%7B%5C%22localteam_coach_id%5C%22:%20474720,%20%5C%22visitorteam_coach_id%5C%22:%20474796%7D,%20%5C%22attendance%5C%22:%20null,%20%5C%22scores%5C%22:%20%7B%5C%22ft_score%5C%22:%20null,%20%5C%22visitorteam_score%5C%22:%200,%20%5C%22et_score%5C%22:%20null,%20%5C%22localteam_pen_score%5C%22:%20null,%20%5C%22visitorteam_pen_score%5C%22:%20null,%20%5C%22localteam_score%5C%22:%200,%20%5C%22ht_score%5C%22:%20null%7D,%20%5C%22referee_id%5C%22:%2016781,%20%5C%22stage_id%5C%22:%201728,%20%5C%22weather_report%5C%22:%20null,%20%5C%22league_id%5C%22:%20732,%20%5C%22localteam_id%5C%22:%2018704,%20%5C%22time%5C%22:%20%7B%5C%22status%5C%22:%20%5C%22NS%5C%22,%20%5C%22starting_at%5C%22:%20%7B%5C%22date%5C%22:%20%5C%222018-07-06%5C%22,%20%5C%22date_time%5C%22:%20%5C%222018-07-06%2018:00:00%5C%22,%20%5C%22timezone%5C%22:%20%5C%22UTC%5C%22,%20%5C%22timestamp%5C%22:%201530900000,%20%5C%22time%5C%22:%20%5C%2218:00:00%5C%22%7D,%20%5C%22extra_minute%5C%22:%20null,%20%5C%22injury_time%5C%22:%20null,%20%5C%22second%5C%22:%20null,%20%5C%22added_time%5C%22:%20null,%20%5C%22minute%5C%22:%20null%7D,%20%5C%22group_id%5C%22:%20null%7D%5D%22'

Output for

print(ts)

Out[45]: 
[{u'aggregate_id': None,
  u'attendance': None,
  u'coaches': {u'localteam_coach_id': 472158, u'visitorteam_coach_id': 474616},
  u'commentaries': None,
  u'deleted': False,
  u'formations': {u'localteam_formation': None,
   u'visitorteam_formation': None},
  u'group_id': None,
  u'id': 10342083,
  u'league_id': 732,
  u'localteam_id': 15251,
  u'pitch': None,
  u'referee_id': 18783,
  u'round_id': None,
  u'scores': {u'et_score': None,
   u'ft_score': None,
   u'ht_score': None,
   u'localteam_pen_score': None,
   u'localteam_score': 0,
   u'visitorteam_pen_score': None,
   u'visitorteam_score': 0},
  u'season_id': 892,
  u'stage_id': 1728,
  u'standings': {u'localteam_position': 1, u'visitorteam_position': 1},
  u'time': {u'added_time': None,
   u'extra_minute': None,
   u'injury_time': None,
   u'minute': None,
   u'second': None,
   u'starting_at': {u'date': u'2018-07-06',
    u'date_time': u'2018-07-06 14:00:00',
    u'time': u'14:00:00',
    u'timestamp': 1530885600,
    u'timezone': u'UTC'},
   u'status': u'NS'},
  u'venue_id': 273277,
  u'visitorteam_id': 18647,
  u'weather_report': None,
  u'winning_odds_calculated': False},
 {u'aggregate_id': None,
  u'attendance': None,
  u'coaches': {u'localteam_coach_id': 474720, u'visitorteam_coach_id': 474796},
  u'commentaries': None,
  u'deleted': False,
  u'formations': {u'localteam_formation': None,
   u'visitorteam_formation': None},
  u'group_id': None,
  u'id': 10344350,
  u'league_id': 732,
  u'localteam_id': 18704,
  u'pitch': None,
  u'referee_id': 16781,
  u'round_id': None,
  u'scores': {u'et_score': None,
   u'ft_score': None,
   u'ht_score': None,
   u'localteam_pen_score': None,
   u'localteam_score': 0,
   u'visitorteam_pen_score': None,
   u'visitorteam_score': 0},
  u'season_id': 892,
  u'stage_id': 1728,
  u'standings': {u'localteam_position': 1, u'visitorteam_position': 1},
  u'time': {u'added_time': None,
   u'extra_minute': None,
   u'injury_time': None,
   u'minute': None,
   u'second': None,
   u'starting_at': {u'date': u'2018-07-06',
    u'date_time': u'2018-07-06 18:00:00',
    u'time': u'18:00:00',
    u'timestamp': 1530900000,
    u'timezone': u'UTC'},
   u'status': u'NS'},
  u'venue_id': 8869,
  u'visitorteam_id': 18743,
  u'weather_report': None,
  u'winning_odds_calculated': False}]

print(json.dumps(ts))

Out[44]: '[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]'

Thanks in advance!

PS. - Here is the link on how to do it with Scala - http://spark.apache.org/docs/2.2.0/sql-programming-guide.html#tab_scala_5

1
  • if ts is in the format as you have posted then (json.dumps(ts) would have string json with \n as [{'aggregate_id': None,\n 'attendance': None,\n 'coaches':... isn't that so? Commented Jul 6, 2018 at 5:01

1 Answer 1

2

You said

I am trying to convert it to a dataframe directly from a variable instead of a JSON file upload; mainly because I get the JSON data from a GET request to an API.

So I assume ts is a variable like

ts = """[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]"""

Now, json.dumps(ts) would give you a string and .json(json.dumps(ts)) is treating json.dumps(ts) as a path and thats what the error message is suggesting you

IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: "[{\"standings\":%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C

And API documentation says the following

.... :param path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects. .......

So if you want to use the variable ts then, as the API documenation says, you will have to convert the string json.dumps(ts) to RDD as

tsRDD = sc.parallelize([ts])
df = spark.read.option('multiline', "true").json(tsRDD)

which should give the correct dataframe

+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
|aggregate_id|attendance|coaches         |commentaries|deleted|formations|group_id|id      |league_id|localteam_id|pitch|referee_id|round_id|scores      |season_id|stage_id|standings|time                                                                    |venue_id|visitorteam_id|weather_report|winning_odds_calculated|
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
|null        |null      |[472158, 474616]|null        |false  |[,]       |null    |10342083|732      |15251       |null |18783     |null    |[,,,, 0,, 0]|892      |1728    |[1, 1]   |[,,,,, [2018-07-06, 2018-07-06 14:00:00, 14:00:00, 1530885600, UTC], NS]|273277  |18647         |null          |false                  |
|null        |null      |[474720, 474796]|null        |false  |[,]       |null    |10344350|732      |18704       |null |16781     |null    |[,,,, 0,, 0]|892      |1728    |[1, 1]   |[,,,,, [2018-07-06, 2018-07-06 18:00:00, 18:00:00, 1530900000, UTC], NS]|8869    |18743         |null          |false                  |
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+

root
 |-- aggregate_id: string (nullable = true)
 |-- attendance: string (nullable = true)
 |-- coaches: struct (nullable = true)
 |    |-- localteam_coach_id: long (nullable = true)
 |    |-- visitorteam_coach_id: long (nullable = true)
 |-- commentaries: string (nullable = true)
 |-- deleted: boolean (nullable = true)
 |-- formations: struct (nullable = true)
 |    |-- localteam_formation: string (nullable = true)
 |    |-- visitorteam_formation: string (nullable = true)
 |-- group_id: string (nullable = true)
 |-- id: long (nullable = true)
 |-- league_id: long (nullable = true)
 |-- localteam_id: long (nullable = true)
 |-- pitch: string (nullable = true)
 |-- referee_id: long (nullable = true)
 |-- round_id: string (nullable = true)
 |-- scores: struct (nullable = true)
 |    |-- et_score: string (nullable = true)
 |    |-- ft_score: string (nullable = true)
 |    |-- ht_score: string (nullable = true)
 |    |-- localteam_pen_score: string (nullable = true)
 |    |-- localteam_score: long (nullable = true)
 |    |-- visitorteam_pen_score: string (nullable = true)
 |    |-- visitorteam_score: long (nullable = true)
 |-- season_id: long (nullable = true)
 |-- stage_id: long (nullable = true)
 |-- standings: struct (nullable = true)
 |    |-- localteam_position: long (nullable = true)
 |    |-- visitorteam_position: long (nullable = true)
 |-- time: struct (nullable = true)
 |    |-- added_time: string (nullable = true)
 |    |-- extra_minute: string (nullable = true)
 |    |-- injury_time: string (nullable = true)
 |    |-- minute: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- starting_at: struct (nullable = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- date_time: string (nullable = true)
 |    |    |-- time: string (nullable = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- timezone: string (nullable = true)
 |    |-- status: string (nullable = true)
 |-- venue_id: long (nullable = true)
 |-- visitorteam_id: long (nullable = true)
 |-- weather_report: string (nullable = true)
 |-- winning_odds_calculated: boolean (nullable = true)

Or you can just save the variables in a file and use

df = spark.read.option('multiline', "true").json(path to the file)

which works as perfect as above suggestion

I hope the answer is helpful

Sign up to request clarification or add additional context in comments.

9 Comments

Hi Ramesh, thanks a lot for your reply. I did try the serialize function but it returned corrupted data because i didn't add the square parentheses around ts as an argument. Is there a reason we are doing that given that ts is already a stringified list of JSON objects?
I would need to see the datatype and data of ts. samples would do.
I have appended the data dump in the question above, could you please check?
if ts is in the format as you have posted then (json.dumps(ts) would have string json with \n as [{'aggregate_id': None,\n 'attendance': None,\n 'coaches':... isn't that so?
Somehow it doesn't show that way on Databricks output with the print function! Works as expected in the final output though.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.