2

Scrapy and some others libraries in python start to write and read the json lines format for json files :

I try to convert json files using json lines specification to panda dataframe using the read_json(...) function :

My file "input.json" is similar to that, one line by capture :

{"A": {"page": 1, "name": "foo", "url": "xxx"}, "B": {"page": 1, "name": "bar", "url": "http://xxx"}, "C": {"page": 3, "name": "foo", "url": "http://xxx"}}
{"D": {"page": 2, "name": "bar", "url": "xxx"}, "E": {"page": 2, "name": "bar", "url": "http://xxx"}, "F": {"page": 3, "name": "foo", "url": "http://xxx"}} 

What i want on output :

  page name url
A 1    foo  http://xxx
B 1    bar  http://xxx
C 3    foo  http://xxx
D 2    bar  http://xxx
E 2    bar  http://xxx
F 3    boo  http://xxx

In first intention, i try to use this, but the result is not correct :

print( pd.read_json("file:///input.json", orient='index', lines=True))

I see that orient='index' in the panda doc use this specification {index -> {column -> value}} But the result produced show that i don't understand something :

                                                 0                                                1
A         {'page': 1, 'url': 'xxx', 'name': 'foo'}                                              NaN
B  {'page': 1, 'url': 'http://xxx', 'name': 'bar'}                                              NaN
C  {'page': 3, 'url': 'http://xxx', 'name': 'foo'}                                              NaN
D                                              NaN         {'page': 2, 'url': 'xxx', 'name': 'bar'}
E                                              NaN  {'page': 2, 'url': 'http://xxx', 'name': 'bar'}
F                                              NaN  {'page': 3, 'url': 'http://xxx', 'name': 'foo'}

2 Answers 2

5

You can consider using a combination of stack(), reset_index() and apply() to get what you want. Two lines are all you need:

df = pd.read_json("file:///input.json", orient='index', lines=True).stack().reset_index(level=1, drop=True)

# Here the .stack() basically flattens your extraneous columns into one.
# .reset_index() is to remove the extra index level that was added by stack()
#
# df
#
# A           {'page': 1, 'name': 'foo', 'url': 'xxx'}
# B    {'page': 1, 'name': 'bar', 'url': 'http://xxx'}
# C    {'page': 3, 'name': 'foo', 'url': 'http://xxx'}
# D           {'page': 2, 'name': 'bar', 'url': 'xxx'}
# E    {'page': 2, 'name': 'bar', 'url': 'http://xxx'}
# F    {'page': 3, 'name': 'foo', 'url': 'http://xxx'}
# dtype: object

df = df.apply(pd.Series, index=df[0].keys())

# Here you use .apply() to extract the dictionary into columns by applying them as a Series.
# the index keyword is to sort it per the keys of first dictionary in the df.
#
# df
#
#        page name         url
#  A        1  foo         xxx
#  B        1  bar  http://xxx
#  C        3  foo  http://xxx
#  D        2  bar         xxx
#  E        2  bar  http://xxx
#  F        3  foo  http://xxx

Bit of a hack, but helps you interpret the jsonlines correctly without going through a loop.

Sign up to request clarification or add additional context in comments.

1 Comment

I tried this now (years after the answer) but it works only with orient="columns" rather than orient="index". Hope someone can confirm this.
3

As you are working with JSON lines,

  1. you need to read the file line by line,
  2. convert each line to a dictionary,
  3. create a dataframe from that dictionary
  4. and append it to a list of dataframes
  5. finally, you can concatenate those dataframes together using pandas concat

And Voila :

import json
line_list = []
with open('sample.json') as f:
    for line in f:
        a_dict = json.loads(line)
        df = pd.DataFrame(a_dict).T
        line_list.append(df)

df = pd.concat(line_list)

and here is the desired output

    name    page    url
A   foo 1   xxx
B   bar 1   http://xxx
C   foo 3   http://xxx
D   bar 2   xxx
E   bar 2   http://xxx
F   foo 3   http://xxx

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.