Convert json at json lines specification to panda using read_json function?

Question

Scrapy and some others libraries in python start to write and read the json lines format for json files :

I try to convert json files using json lines specification to panda dataframe using the read_json(...) function :

My file "input.json" is similar to that, one line by capture :

{"A": {"page": 1, "name": "foo", "url": "xxx"}, "B": {"page": 1, "name": "bar", "url": "http://xxx"}, "C": {"page": 3, "name": "foo", "url": "http://xxx"}}
{"D": {"page": 2, "name": "bar", "url": "xxx"}, "E": {"page": 2, "name": "bar", "url": "http://xxx"}, "F": {"page": 3, "name": "foo", "url": "http://xxx"}}

What i want on output :

  page name url
A 1    foo  http://xxx
B 1    bar  http://xxx
C 3    foo  http://xxx
D 2    bar  http://xxx
E 2    bar  http://xxx
F 3    boo  http://xxx

In first intention, i try to use this, but the result is not correct :

print( pd.read_json("file:///input.json", orient='index', lines=True))

I see that orient='index' in the panda doc use this specification {index -> {column -> value}} But the result produced show that i don't understand something :

                                                 0                                                1
A         {'page': 1, 'url': 'xxx', 'name': 'foo'}                                              NaN
B  {'page': 1, 'url': 'http://xxx', 'name': 'bar'}                                              NaN
C  {'page': 3, 'url': 'http://xxx', 'name': 'foo'}                                              NaN
D                                              NaN         {'page': 2, 'url': 'xxx', 'name': 'bar'}
E                                              NaN  {'page': 2, 'url': 'http://xxx', 'name': 'bar'}
F                                              NaN  {'page': 3, 'url': 'http://xxx', 'name': 'foo'}

r.ook · Accepted Answer · 2018-02-08 20:39:44Z

5

You can consider using a combination of stack(), reset_index() and apply() to get what you want. Two lines are all you need:

df = pd.read_json("file:///input.json", orient='index', lines=True).stack().reset_index(level=1, drop=True)

# Here the .stack() basically flattens your extraneous columns into one.
# .reset_index() is to remove the extra index level that was added by stack()
#
# df
#
# A           {'page': 1, 'name': 'foo', 'url': 'xxx'}
# B    {'page': 1, 'name': 'bar', 'url': 'http://xxx'}
# C    {'page': 3, 'name': 'foo', 'url': 'http://xxx'}
# D           {'page': 2, 'name': 'bar', 'url': 'xxx'}
# E    {'page': 2, 'name': 'bar', 'url': 'http://xxx'}
# F    {'page': 3, 'name': 'foo', 'url': 'http://xxx'}
# dtype: object

df = df.apply(pd.Series, index=df[0].keys())

# Here you use .apply() to extract the dictionary into columns by applying them as a Series.
# the index keyword is to sort it per the keys of first dictionary in the df.
#
# df
#
#        page name         url
#  A        1  foo         xxx
#  B        1  bar  http://xxx
#  C        3  foo  http://xxx
#  D        2  bar         xxx
#  E        2  bar  http://xxx
#  F        3  foo  http://xxx

Bit of a hack, but helps you interpret the jsonlines correctly without going through a loop.

answered Feb 8, 2018 at 20:39

r.ook

13.9k2 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Chris Seeling Over a year ago

I tried this now (years after the answer) but it works only with orient="columns" rather than orient="index". Hope someone can confirm this.

Espoir Murhabazi · Accepted Answer · 2018-02-09 06:42:14Z

3

As you are working with JSON lines,

you need to read the file line by line,
convert each line to a dictionary,
create a dataframe from that dictionary
and append it to a list of dataframes
finally, you can concatenate those dataframes together using pandas concat

And Voila :

import json
line_list = []
with open('sample.json') as f:
    for line in f:
        a_dict = json.loads(line)
        df = pd.DataFrame(a_dict).T
        line_list.append(df)

df = pd.concat(line_list)

and here is the desired output

    name    page    url
A   foo 1   xxx
B   bar 1   http://xxx
C   foo 3   http://xxx
D   bar 2   xxx
E   bar 2   http://xxx
F   foo 3   http://xxx

edited Feb 9, 2018 at 6:42

answered Feb 8, 2018 at 18:58

Espoir Murhabazi

6,4415 gold badges49 silver badges78 bronze badges

Collectives™ on Stack Overflow

Convert json at json lines specification to panda using read_json function?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related