3

I'm working with a file containing one json block per line. Each line looks something like this:

{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}

The producer of the json chose an unnecessary nested structure whereas a flat structure would have been perfectly sufficient. That said, I'd like to read the data into a Pandas DataFrame in a more obvious flattened structure that would have columns "a", "b", "c", "d", "e", "index". The best that I've come up with so far is to process the file twice in different ways:

import pandas as pd
from pandas.io.json import json_normalize, loads

raw_json = pd.read_json('sample.json', lines=True)
raw_json.set_index('index', inplace=True)

with open('sample.json') as f:
    lines = f.readlines()
    exploded_columns = pd.concat([json_normalize(loads(l), 'unnecessaryList', 'index').pivot(index='index', columns='colName', values='value') for l in lines])

data = pd.merge(raw_json[['a', 'b']], exploded_columns, left_index=True, right_index=True)

Is there a way to avoid reading the data twice like this? Does Pandas offer some functionality that could avoid the concat/normalize/pivot/merge junk I came up with?

1 Answer 1

6

You can use:

df = pd.read_json('sample.json', lines=True)
#create Multiindex from 3 columns and select unnecessaryList for Series
s = df.set_index(['a','b','index'])['unnecessaryList']
print (s)
a  b   index      
3  10  -1417561653    [{'value': 12, 'colName': 'c'}, {'value': 792,...
       -1417561655    [{'value': 13, 'colName': 'c'}, {'value': 794,...
       -1417561658    [{'value': 14, 'colName': 'c'}, {'value': 795,...
Name: unnecessaryList, dtype: object

#create DataFrame for each dict and concat, transpose
L = [pd.DataFrame(x).set_index('colName')['value'] for x in s]
df = (pd.concat(L, axis=1, keys=s.index)
       .T
       .reset_index(level=[0,1])
       .rename(columns={'level_0':'a','level_1':'b'})
       .rename_axis(None, 1))
print (df)
             a   b   c    d    e
-1417561653  3  10  12  792  645
-1417561655  3  10  13  794  645
-1417561658  3  10  14  795  645

Input data in json:

{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}
{"a":3,"b":10,"unnecessaryList":[{"value":13,"colName":"c"},{"value":794,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561655"}
{"a":3,"b":10,"unnecessaryList":[{"value":14,"colName":"c"},{"value":795,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561658"}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.