0

I have a CSV file which contains arbitrary JSON objects. Here's a simplified version of the file:

v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}},lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,{"scheduleIds":[]},lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"},lastcolumn

The commas (if any) inside the JSON wreak havok with CSV parsing.

I'm looking for a way to either...

...capture and replace all the commas outside the JSON objects with pipes (|) so I can simply key on those:

v1|2020-06-09T22:44:46.377Z|cb6deb64-d6a0-4151-ba9b-bfa54ae75180|{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff"**,**"permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}|lastcolumn
v2|2020-06-09T22:44:47.377Z|50769c0d-0a05-4028-9f0b-40ab570af31a|{"scheduleIds":[]}|lastcolumn
v3|2020-06-09T22:44:48.377Z|12345678-0a05-4028-9f0b-40ab570af31a|{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}|lastcolumn

...or wrap each JSON object with single quotes:

v1,2020-06-09T22:44:46.377Z,cb6deb64-d6a0-4151-ba9b-bfa54ae75180,'{"payload":{"assetId":"a3c2a944-d554-44bb-90a4-b7beafbc6bff","permissionsToParty":[{"partyType":1,"partyId":"74457bd4-c2ab-4760-942b-d6c623a97f19","permissions":["CREATE","DELETE","DOWNLOAD","EDIT","VIEW"]}]}}',lastcolumn
v2,2020-06-09T22:44:47.377Z,50769c0d-0a05-4028-9f0b-40ab570af31a,'{"scheduleIds":[]}',lastcolumn
v3,2020-06-09T22:44:48.377Z,12345678-0a05-4028-9f0b-40ab570af31a,'{"jobId":"4dfeb16d-f9d6-4480-9b84-60c5af0bd3ce","result":"success","status":"completed"}',lastcolumn

Alas, my regex kung-fu is too weak to create something flexible enough based on the arbitrary nature of the JSON objects that may show up.

The closest I've gotten is:

(?!\B{[^}]*),(?![^{]*}\B)

Which still captures commas (the comma directly before "permissionsToParty", below) in an object like this:

{"payload":{"assetId":"710728f9-7c13-4bcb-8b5d-ef347afe0b58","permissionsToParty":[{"partyType":0,"partyId":"32435a92-c7b3-4fc0-b722-2e88e9e839e5","permissions":["CREATE","DOWNLOAD","VIEW"]}]}}

Can anyone simplify what I've done thus far and help me with an expression that ignores ALL commas within the outermost {} symbols of the JSON?

5
  • 1
    Does this help in order to convert the json part to quotes. Commented Jun 10, 2020 at 2:28
  • 2
    I don't follow. If it's a valid CSV file can't you read it with CSV tools and then save it with a pipe as the column separator? Commented Jun 10, 2020 at 3:34
  • 1
    This might help to solve your comma problem. Commented Jun 10, 2020 at 3:36
  • @Mandy8055 BOTH of these will solve the problem. Thank you! Feel free to post either and I'll mar as answer. Commented Jun 10, 2020 at 16:37
  • @CarySwoveland That's the challenge, based on whether the JSON object contains 0 or more commas, when the CSV gets parsed you get an uneven number of "columns" back on a per-row basis. I need to pre-process this sucker to make sure the arbitrary number of commas that show up inside the JSON doesn't blow up the CSV parsing process. Commented Jun 10, 2020 at 16:40

2 Answers 2

1

Your regex is quite close to your expectations. In order to get rid of the expected comma you may try the below regex:

(?!\B{[^}]*|"),(?![^{]*}\B|")
           ^^             ^^
           Changed this part

You can find the demo of the above regex in here.

But you can use this second regex below to put the json string to quotes. This is more efficient as compared to the above regex and I'll recommend that you use this.

(\{.*\})

Explanation of the above regex:

() - Represents capturing group.

\{ - Matches the curly brace { literally.

.* - Matches everything inside of the curly braces greedily in order to capture all the nested braces.

'$1' - You can replace the given test string with 1st captured group $1 and put quotes outside of it.

You can find the demo of the above regex in here.

Pictorial Representation of the first regex:

Pictorial

Sign up to request clarification or add additional context in comments.

Comments

1

Remember that you can do alot with regex but sometimes you need create your own code for it. You can't do everything with it.

How to do this depends on what you know about the csv sent. It looks like there isn't any values within double quotes if you do not count the json part?

Some regex engines has recurson.
If that works finding json parts with this expression \{((?>[^{}]+|(?R))*)\}
Description how it works Recursion explained here.

Here is a guide how csv can be parsed if it has double quoted parts.
Expression: (?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
Guide to parse cvs

If you know that cvs do not contain any double quoted values, then it may be doable if you convert it in a two step process.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.