Issues with regex to match JSON-like string with optionally missing [] brackets around lists

Question

The following string is a typical example of the format of JSON input strings that I need to convert to a pandas DataFrame. My attempted work flow is to:

split String into List (see String below, note this represents an individual row)
Convert each list to a dictionary
Convert dictionary to a pd.DataFrame
Merge DataFrames together

Input String: (Representing one row of Data)

"PN_#":9999,"Item":"Pear, Large","Vendor":["Farm"],"Class":["Food","Fruit"],"Sales Group":"59","Vendor ID (from Vendor)":[78]

Desired Output List:

{'PN_#':9999,
'Item':"Pear, Large",
'Vendor':"Farm",
'Class':"Food,Fruit",
'Sales Group':59,
'```
Vendor ID (from Vendor)':78}

Attempt: I have been using re.split to attempt this. For most cases this is not an issue, however the items such as "Class":["Food","Fruit"] and "Item":"Pear, Large" are proving to be challenging to account for.

This regex solves the issues of the latter case, however it obviously does not work for the former:

re.split("(?=[\S]),(?=[\S])",data)

I have tried a multitude of expressions to completely satisfy my requirements. The following expression is generally representative of what I have attempted unsuccessfully:

regex.split("(?!\[.+?\s),(?=[\S])(?!.+?\])", data)

Any suggestion or solutions for how to accomplish this, or suggestion if I am going about this the wrong way?

That's not quite valid JSON, the [] are needed around a list: "Item":"Pear, Large" unlike "Class":["Food","Fruit"] — smci
– smci, Commented Mar 11, 2022 at 20:07

Wiktor Stribiżew · Accepted Answer · 2021-06-09 19:45:50Z

1

Your string is a valid JSON without braces. Add the braces and use json.loads to get the JSON object.

Next, just iterate the object, and if the current key contains a list of strings, join them:

import json
s='"PN_#":9999,"Item":"Pear, Large","Vendor":["Farm"],"Class":["Food","Fruit"],"Sales Group":"59","Vendor ID (from Vendor)":[78]'
js = json.loads(f'{{{s}}}')
for key in js:
    if isinstance(js[key], list): # is it a list?
        if all(isinstance(x, str) for x in js[key]): # is it a list of strings?
            js[key] = ",".join(js[key])
        else:
            js[key] = ",".join(map(str, js[key]))
print(js)

Output:

{'PN_#': 9999, 'Item': 'Pear, Large', 'Vendor': 'Farm', 'Class': 'Food,Fruit', 'Sales Group': '59', 'Vendor ID (from Vendor)': '78'}

See the online Python demo.

edited Jun 9, 2021 at 19:45

answered Jun 9, 2021 at 17:20

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

MountainFish Over a year ago

Thank you for the quick response. This nearly works, the only issue is that there are two values for the key 'Class' not one.

Wiktor Stribiżew Over a year ago

@MountainFish You can join the string values. See the updated answer.

MountainFish Over a year ago

I think I almost have it. The following joins everything as needed except a list of of numbers:

for key,value in temp_row.items():         try:             if isinstance(temp_row[key], list):                 temp_row[key] = ','.join(value)         except:             pass

sorry I'm new to posting and experiencing issues including code in a comment

Wiktor Stribiżew Over a year ago

@MountainFish See the updated answer. It will now cast to str all the lists where the items are not all strings.

MountainFish Over a year ago

Thanks for your help, I really appreciated it and I have now upvoted your answer.

|

Collectives™ on Stack Overflow

Issues with regex to match JSON-like string with optionally missing [] brackets around lists

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related