I'm working with a file containing one json block per line. Each line looks something like this:
{"a":3,"b":10,"unnecessaryList":[{"value":12,"colName":"c"},{"value":792,"colName":"d"},{"value":645,"colName":"e"}],"index":"-1417561653"}
The producer of the json chose an unnecessary nested structure whereas a flat structure would have been perfectly sufficient. That said, I'd like to read the data into a Pandas DataFrame in a more obvious flattened structure that would have columns "a", "b", "c", "d", "e", "index". The best that I've come up with so far is to process the file twice in different ways:
import pandas as pd
from pandas.io.json import json_normalize, loads
raw_json = pd.read_json('sample.json', lines=True)
raw_json.set_index('index', inplace=True)
with open('sample.json') as f:
lines = f.readlines()
exploded_columns = pd.concat([json_normalize(loads(l), 'unnecessaryList', 'index').pivot(index='index', columns='colName', values='value') for l in lines])
data = pd.merge(raw_json[['a', 'b']], exploded_columns, left_index=True, right_index=True)
Is there a way to avoid reading the data twice like this? Does Pandas offer some functionality that could avoid the concat/normalize/pivot/merge junk I came up with?