0

I have a dataframe that contains json style dictionaries per row, and I need to extract the fourth and eigth fields, or values from the second and fourth pairs to a new column i.e. 'a' for row 1, 'a' for 2 (corresponding to the Group) and '10786' for row 1, '38971' for row 2 (corresponding to Code). The expected output is below.

dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
                          '{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})

expected output

                                                   a Group   Code
0  {"Number":1,"Group":"a","First":"Yes","Second"...     a  10786
1  {"Number":2,"Group":"a","First":"Yes","Second"...     a  38971

I have tried indexing by location but its printing only characters rather than fields e.g.

tuple(dat['a'].items())[0][1][4]

I also cannot appear to normalize the data with json_normalize, which I'm not sure why - perhaps the json string is stored suboptimally. So I am quite confused, and any tips would be great. Thanks so much

0

1 Answer 1

1

The reason pd.json_normalize is not working is because pd.json_normalize works on dictionaries and your strings are not json compatible.

Your strings are not json compatible because the "Labs" values contain multiple quotes which aren't escaped. It's possible to write a quick function to make your strings json compatible, then parse them as dictionaries, and finally use pd.json_normalize.

import pandas as pd
import json
import re

jsonify = lambda x: re.sub(pattern='"Labs":"(.*)"', repl='"Labs":\g<1>', string=x) # Function to remove unnecessary quotes

dat = pd.DataFrame({ 'col1': ['{"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"}',
                          '{"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"}']})

json_series = dat['col1'].apply(jsonify) # Remove unnecessary quotes
json_series = json_series.apply(json.loads) # Convert string to dictionary
output = pd.json_normalize(json_series) # "Flatten" the dictionary into columns
output.insert(loc=0, column='a', value=dat['col1']) # Add the original column as a column named "a" because that's what the question requests.
a Number Group First Code Desc Labs
0 {"Number":1,"Group":"a","First":"Yes","Code":"10786","Desc":true,"Labs":"["a","b","c"]"} 1 a Yes 10786 True ['a', 'b', 'c']
1 {"Number":2,"Group":"a","First":"Yes","Code":"38971","Desc":true,"Labs":"["a","b","c"]"} 2 a Yes 38971 True ['a', 'b', 'c']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.