3

I have a DataFrame structured as follows. (It is a result of JSON normalization)

mydf

id    colA    colB    ...    colArray
foo   a1      b1             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
bar   a2      b2             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
fooz  a3      b3             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
barz  a4      b4             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
  • date are timestamps
  • each array in lines of colArray have a different length, but have exact same array element structure
  • ['id', 'colA', 'colB'] is an example of columns that I would like to use as unique indexes

I would like to convert these data in order to use them as time series. My desired output would be something like:

id    colA    colB    ...    date               data1    data2 ... data n
foo   a1      b1             '1st timestamp'   'flex'   0.1
foo   a1      b1             '...'   
...
foo   a1      b1             'last_timestamp'   
bar   a2      b2             '1st timestamp'   'zorg'
bar   a2      b2             '...'   
...   
bar   a2      b2             'last_timestamp'   
fooz  a3      b3             '...'   
fooz  a3      b3             '...'   
...
fooz  a3      b3             '...'   
etc.

That would allow me to plot/analyze times series based on tuples such as [foo, a1, b1]

To me, this looks very similar to Flatten nested pandas dataframe, but the accepted answer is frustrating : the JSON/dict data is not really processed to produce a DataFrame with the correct data.


Does anyone have any advice on how to achieve this ?


First approach

Use the following, which is close to what I want:

tmpdf = pd.DataFrame(mydf['colArray'].tolist())
json_normalize(tmpdf[0])

But there are 2 issues:

  1. I lost the ['id', 'colA', 'colB'] tuple that i would like to use as a unique identifier.
  2. I need to perform the operation for each row of my tmpdf

Second method

Based on Accessing nested JSON data as dataframes in Pandas

pd.concat(pd.DataFrame.from_dict(tmp_array) for array in mydf['colArray'])

It gives me a dataframe with all my arrays flattened, correct columns names, but I lost the correspond keys ( ['id', 'colA', 'colB']). I feel this is the correct approach, but I can't figure out how to keep indexing columns (so that I can filter each resulting time series by indexing columns).

Too bad there is no "json_melt" function

Third method

Based on this question Flatten nested pandas dataframe. I can preserve my indexing columns, but the array elements are still in JSON and indexed as [0, 1, 2, ...]. I will have trouble dealing with the variable length (lots of NA for the higher values of columns indexes


Bibliography: Create a Pandas DataFrame from deeply nested JSON But the solution is based on the original JSON processing, whereas I would like to do this on an existing DataFrame

Accessing nested JSON data as dataframes in Pandas This is pretty close to what I want.

Flatten nested pandas dataframe The result looks like my first attempt, but the underlying JSON data is not really "matrixed" into the dataframe.

A rather complex and not satisfaying approach

EDIT: This question is the same But at time of asking, I could not find it via search. For future reference ?

1 Answer 1

3

Use dictionary comprehension with pop for extract original column and concat for MulltiIndex:

df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})

Alternative is use parameter keys:

df = pd.concat([pd.DataFrame(array) for array in mydf.pop('colArray')], keys=mydf.index)

Then remove second level, so possible join with original DataFrame:

df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)

Sample:

mydf = pd.DataFrame({'id': ['foo', 'bar', 'fooz', 'barz'], 'colA': ['a1', 'a2', 'a3', 'a4'], 'colB': ['b1', 'b2', 'b3', 'b4'], 'colArray': [[{'date': 's', 'data1': 't', 'data2': 0.1}, {'date': 'd', 'data1': 'r', 'data2': 0.8}], [{'date': 'd', 'data1': 'y', 'data2': 0.1}], [{'date': 'g', 'data1': 'u', 'data2': 0.1}], [{'date': 'h', 'data1': 'i', 'data2': 0.1}]]})
print (mydf)
     id colA colB                                           colArray
0   foo   a1   b1  [{'date': 's', 'data1': 't', 'data2': 0.1}, {'...
1   bar   a2   b2        [{'date': 'd', 'data1': 'y', 'data2': 0.1}]
2  fooz   a3   b3        [{'date': 'g', 'data1': 'u', 'data2': 0.1}]
3  barz   a4   b4        [{'date': 'h', 'data1': 'i', 'data2': 0.1}]

df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})
print (df)
    data1  data2 date
0 0     t    0.1    s
  1     r    0.8    d
1 0     y    0.1    d
2 0     u    0.1    g
3 0     i    0.1    h

df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)
print (df)
  data1  data2 date    id colA colB
0     t    0.1    s   foo   a1   b1
1     r    0.8    d   foo   a1   b1
2     y    0.1    d   bar   a2   b2
3     u    0.1    g  fooz   a3   b3
4     i    0.1    h  barz   a4   b4
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your quick answer, it works like a charm! To be honest, I don't really get the magic that concat + keys=... param do. I mean, I get that pop returns a new DF with indexing preserved. Then the concat part does the magic (quoting concat doc: If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected). If I had to say it with my own words : using [comprehension + keys= ], or using [ .pop().items()] fed concat with the right data for magic to happen ?
Erf, I had all the intel needed at hand(for the first part), but couldn't figure it out from documentation ... I get the reset_index mechanic, but would have searched for a long time to find it.
I read again the Pandas documentation (pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, pandas.pydata.org/pandas-docs/stable/reference/api/…), and still don't find keys= argument behavior very clearly explained. Anyways, I'm probably missing it.
@LoneWanderer I think the best explain it this Q/A
damn it stackoverflow.com/questions/22676081/… and stackoverflow.com/questions/49671693/… already coveer the issue ... can't believe I did not find them ! i'll add them to the biblio

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.