Expand nested data (json, Pandas)

Question

Consider a snippet of a

{
    "participant_id": 37,
    "response_date": "2016-05-19T07:19:32.620Z",
    "data": {
        "summary": 8,
        "q6": [
            "1",
            "2"
        ],
        "q1": 0,
        "q2": 1,
        "q3": 1,
        "q4": 2,
        "q5": 2
    }
}, 
{
    "participant_id": 37,
    "response_date": "2016-05-26T07:14:24.7130Z",
    "data": {
        "summary": 8,
        "q6": [
            "1",
            "2",
            "4"
        ],
        "q1": 0,
        "q2": 1,
        "q3": 1,
        "q4": 2,
        "q5": 2
    }
}

which will produce a Pandas data frame:

        0   q1   q2   q3   q4   q5         q6  summary    participant_id           response_date
672   NaN  0.0  1.0  1.0  2.0  2.0     [1, 2]      8.0                37 2016-05-19 07:19:32.620
711   NaN  0.0  1.0  1.0  2.0  2.0  [1, 2, 4]      7.0                37 2016-05-26 07:14:24.713

How to expand the nested q6 to a 'wider' format? There are up to 4 possible values, that this attribute q6 may contain. So, ideally it should be:

            0   q1   q2   q3   q4   q5   q6   q7   q8   q9   summary    participant_id           response_date
    672   NaN  0.0  1.0  1.0  2.0  2.0  1.0  1.0  0.0  0.0       8.0                37 2016-05-19 07:19:32.620
    711   NaN  0.0  1.0  1.0  2.0  2.0  1.0  1.0  0.0  1.0       7.0                37 2016-05-26 07:14:24.713

So, basically, the numbers in the square bracket encode the position of 1 in 4 element array.

Is there a simple Pandasian solution?

EDIT

Some entries are mistakenly reversed or randomly recorded (1st and 3rd rows):

        0   q1   q2   q3   q4   q5      q6  summary    participant_id           response_date
672   NaN  0.0  1.0  1.0  2.0  2.0  [1, 2]      8.0               37 2016-05-19 07:19:32.620
711   NaN  0.0  1.0  1.0  2.0  2.0     [1]      7.0               37 2016-05-20 07:14:24.713
740   NaN  0.0  1.0  1.0  2.0  2.0  [2, 1]      8.0               37 2016-05-21 07:10:17.251
774   NaN  0.0  1.0  1.0  1.0  3.0  [1, 2]      8.0               37 2016-05-22 08:28:14.579
809   NaN  0.0  1.0  1.0  1.0  3.0  [1, 2]      8.0               37 2016-05-23 07:30:27.259

They should be sorted before any further manipulations are performed.

jezrael · Accepted Answer · 2017-04-20 13:51:15Z

1

I think it is not so easy.

DataFrame + get_dummies for new df
reindex for adding missing values + rename columns
concat to original, (q6) column is removed
startswith + reindex_axis for new order of columns

df1 = pd.get_dummies(pd.DataFrame(df['q6'].values.tolist()), prefix_sep='', prefix='')
df1.columns = df1.columns.astype(int)
df1 =df1.reindex(columns=range(1,5),fill_value=0).rename(columns=lambda x: 'q{}'.format(x+5))
print (df1)
   q6  q7  q8  q9
0   1   1   0   0
1   1   1   0   1

df = pd.concat([df.drop('q6', axis=1), df1], axis=1)
mask = df.columns.str.startswith('q', na=False)
cols1 = df.columns[mask].tolist()
cols2 = df.columns[~mask].tolist()
cols = cols2[:1] + cols1 + cols2[1:]

df = df.reindex_axis(cols, axis=1)
print (df)
    0  q1  q2  q3  q4  q5  q6  q7  q8  q9  summary  participant_id  \
0 NaN   0   1   1   2   2   1   1   0   0        8              37   
1 NaN   0   1   1   2   2   1   1   0   1        8              37   

               response_date  
0   2016-05-19T07:19:32.620Z  
1  2016-05-26T07:14:24.7130Z

edited Apr 20, 2017 at 13:51

answered Apr 20, 2017 at 13:45

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Arnold Klein Over a year ago

thanks! I have several typos in my database, some entries are in reverse order, for example, instead of [1,2], they are [2,1], or [1,4,2]. Is there a simple solution to sort them?

jezrael Over a year ago

I think sort_index would work - df1.reindex(columns=range(1,5),fill_value=0).sort_index(axis=1).rename(columns=lambda x: 'q{}'.format(x+5))

Arnold Klein Over a year ago

it doesn't work. Please see the updated question. Thanks.

jezrael Over a year ago

I check it and I think sort_index is not neccessary, because reindex 'sorts' df1 =df1.reindex(columns=range(1,5),fill_value=0) values to order 1,2,3,4. So if values are different, maybe is necesary change range(1,5).

jezrael Over a year ago

Hmm, one problem ther can be - df1 = pd.get_dummies(pd.DataFrame(df['q6'].values.tolist()), prefix_sep='', prefix='') return unique column names?

|

dylan_fan · Accepted Answer · 2017-04-20 13:34:49Z

1

import pandas as pd
import numpy as np
df = pd.DataFrame({"q6":[[1,2],[1,2,4]]})
a = df.q6.values.tolist()
e = [np.array(x)-1 for x in a]
b = np.zeros((len(e),4))
for i in range(0, len(e)):
    b[i][e[i]] = 1
c = np.hsplit(b,4)
df1 = pd.DataFrame({'q6':c[0],'q7':c[1],'q8':c[2],'q9':c[3]})

At the end i created other data frame with your 4 desired columns.

answered Apr 20, 2017 at 13:34

dylan_fan

7201 gold badge5 silver badges18 bronze badges

Comments

Nilanjan · Accepted Answer · 2017-04-20 13:51:16Z

    >>> df1
       participant_id  q1  q2  q3  q4  q5         q6              response_date  summary
    0              37   0   1   1   2   2     [1, 2]   2016-05-19T07:19:32.620Z        8
    1              37   0   1   1   2   2  [1, 2, 4]  2016-05-26T07:14:24.7130Z        8
    >>> def widen(oldQ6):
    ...     l = [0.0, 0.0, 0.0, 0.0]
    ...     for i in oldQ6:
    ...         l[int(i)-1] = 1.0
    ...     return l
    ... 
    >>> df1['q6'], df1['q7'], df1['q8'], df1['q9'] = zip(*df1['q6'].map(widen))
    >>> df1
       participant_id  q1  q2  q3  q4  q5  q6              response_date  summary  q7  q8  q9
    0              37   0   1   1   2   2   1   2016-05-19T07:19:32.620Z        8  1   0   0
    1              37   0   1   1   2   2   1  2016-05-26T07:14:24.7130Z        8  1   0   1

Collectives™ on Stack Overflow

Expand nested data (json, Pandas)

3 Answers 3

9 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related