2

The following string is a typical example of the format of JSON input strings that I need to convert to a pandas DataFrame. My attempted work flow is to:

  1. split String into List (see String below, note this represents an individual row)
  2. Convert each list to a dictionary
  3. Convert dictionary to a pd.DataFrame
  4. Merge DataFrames together

Input String: (Representing one row of Data)

"PN_#":9999,"Item":"Pear, Large","Vendor":["Farm"],"Class":["Food","Fruit"],"Sales Group":"59","Vendor ID (from Vendor)":[78]

Desired Output List:

{'PN_#':9999,
'Item':"Pear, Large",
'Vendor':"Farm",
'Class':"Food,Fruit",
'Sales Group':59,
'```
Vendor ID (from Vendor)':78}

Attempt: I have been using re.split to attempt this. For most cases this is not an issue, however the items such as "Class":["Food","Fruit"] and "Item":"Pear, Large" are proving to be challenging to account for.

This regex solves the issues of the latter case, however it obviously does not work for the former:

re.split("(?=[\S]),(?=[\S])",data)

I have tried a multitude of expressions to completely satisfy my requirements. The following expression is generally representative of what I have attempted unsuccessfully:

regex.split("(?!\[.+?\s),(?=[\S])(?!.+?\])", data)

Any suggestion or solutions for how to accomplish this, or suggestion if I am going about this the wrong way?

1
  • That's not quite valid JSON, the [] are needed around a list: "Item":"Pear, Large" unlike "Class":["Food","Fruit"] Commented Mar 11, 2022 at 20:07

1 Answer 1

1

Your string is a valid JSON without braces. Add the braces and use json.loads to get the JSON object.

Next, just iterate the object, and if the current key contains a list of strings, join them:

import json
s='"PN_#":9999,"Item":"Pear, Large","Vendor":["Farm"],"Class":["Food","Fruit"],"Sales Group":"59","Vendor ID (from Vendor)":[78]'
js = json.loads(f'{{{s}}}')
for key in js:
    if isinstance(js[key], list): # is it a list?
        if all(isinstance(x, str) for x in js[key]): # is it a list of strings?
            js[key] = ",".join(js[key])
        else:
            js[key] = ",".join(map(str, js[key]))
print(js)

Output:

{'PN_#': 9999, 'Item': 'Pear, Large', 'Vendor': 'Farm', 'Class': 'Food,Fruit', 'Sales Group': '59', 'Vendor ID (from Vendor)': '78'}

See the online Python demo.

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for the quick response. This nearly works, the only issue is that there are two values for the key 'Class' not one.
@MountainFish You can join the string values. See the updated answer.
I think I almost have it. The following joins everything as needed except a list of of numbers: for key,value in temp_row.items(): try: if isinstance(temp_row[key], list): temp_row[key] = ','.join(value) except: pass sorry I'm new to posting and experiencing issues including code in a comment
@MountainFish See the updated answer. It will now cast to str all the lists where the items are not all strings.
Thanks for your help, I really appreciated it and I have now upvoted your answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.