Consider a snippet of a
{
"participant_id": 37,
"response_date": "2016-05-19T07:19:32.620Z",
"data": {
"summary": 8,
"q6": [
"1",
"2"
],
"q1": 0,
"q2": 1,
"q3": 1,
"q4": 2,
"q5": 2
}
},
{
"participant_id": 37,
"response_date": "2016-05-26T07:14:24.7130Z",
"data": {
"summary": 8,
"q6": [
"1",
"2",
"4"
],
"q1": 0,
"q2": 1,
"q3": 1,
"q4": 2,
"q5": 2
}
}
which will produce a Pandas data frame:
0 q1 q2 q3 q4 q5 q6 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2] 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2, 4] 7.0 37 2016-05-26 07:14:24.713
How to expand the nested q6 to a 'wider' format? There are up to 4 possible values, that this attribute q6 may contain. So, ideally it should be:
0 q1 q2 q3 q4 q5 q6 q7 q8 q9 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 1.0 1.0 0.0 0.0 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 1.0 1.0 0.0 1.0 7.0 37 2016-05-26 07:14:24.713
So, basically, the numbers in the square bracket encode the position of 1 in 4 element array.
Is there a simple Pandasian solution?
EDIT
Some entries are mistakenly reversed or randomly recorded (1st and 3rd rows):
0 q1 q2 q3 q4 q5 q6 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2] 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 [1] 7.0 37 2016-05-20 07:14:24.713
740 NaN 0.0 1.0 1.0 2.0 2.0 [2, 1] 8.0 37 2016-05-21 07:10:17.251
774 NaN 0.0 1.0 1.0 1.0 3.0 [1, 2] 8.0 37 2016-05-22 08:28:14.579
809 NaN 0.0 1.0 1.0 1.0 3.0 [1, 2] 8.0 37 2016-05-23 07:30:27.259
They should be sorted before any further manipulations are performed.