1

I have some data, which I'm trying to process. Basically I want to change all the commas , to semicolon ;, but some fields contain text, usernames or passwords that also contain commas. How do I change all the commas except the ones inclosed in "?

Test data:

Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,

What have I tried?

secrets = """Secret Name,URL,Username,Password,Notes,Folder,TOTP Key,TOTP Backup Codes
test1,,username,"pass,word",These are the notes,\Some\Folder,,
test2,,"user1, user2, user3","pass,word","Hello, I'm mr Notes",\Some\Folder,,
test3,http://1.2.3.4/ucsm/ucsm.jnlp,"xxxx\n(use Drop down, select Hello)",password,Use the following\nServer1\nServer2,\Some\Folder,,
"""

test = re.findall(r'(.+?\")(.+)(\".+)', secrets)

for line in test:
    part1, part2, part3 = line
    processed = "".join([part1.replace(",", ";"), part2, part3.replace(",", ";")])
    print(processed)

Result:

test1;;username;"pass,word";These are the notes;\Some\Folder;;
test2;;"user1, user2, user3","pass,word","Hello, I'm mr Notes";\Some\Folder;;

It works fine, when there's only one occurence of "" in the line and no line breaks, but when there are more or there's a line break within the quotations, it's broken. How can I fix this?

FYI: Notes can contain multiple line breaks.

3 Answers 3

2

You don't need a regex here, take advantage of a CSV parser:

import csv, io

inp = csv.reader(io.StringIO(secrets), # or use file as input
                 quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL)
with open('out.csv', 'w') as out:
    csv.writer(out, delimiter=';').writerows(inp)

output file:

Secret Name;URL;Username;Password;Notes;Folder;TOTP Key;TOTP Backup Codes
test1;;username;pass,word;These are the notes;\Some\Folder;;
test2;;user1, user2, user3;pass,word;Hello, I'm mr Notes;\Some\Folder;;
test3;http://1.2.3.4/ucsm/ucsm.jnlp;"xxxx
(use Drop down, select Hello)";password;Use the following
Server1
Server2;\Some\Folder;;

Optionally, use the quoting=csv.QUOTE_ALL parameter in csv.writer.

Sign up to request clarification or add additional context in comments.

Comments

2

This should do I believe:

import re
print( re.sub(r'("[^"]*")|,', lambda x: x.group(1) if x.group(1) else x.group().replace(",", ";"), secrets))

1 Comment

Thank you, this works out great and it retains the quotes, but the above method fits better with my current situation.
1

mozway's solution looks like the best way to resolve this, but interestingly, SM1312's regex works almost perfectly with a much more simple replacement argument for the sub function (i.e. r'\1;'):

import re
print (re.sub(r'("[^"]*")|,', r'\1;', secrets))

The only issue is this introduces an extra semicolon after a quoted entry. This happens because the first alternation member (i.e. ("[^"]*")) does not consume a comma, but the replacement argument adds a semicolon regardless of which alternation member matches. Simply adding a comma to the first alternation member resolves this and works perfectly for the sample data:

import re
print (re.sub(r'("[^"]*"),|,', r'\1;', secrets))

However, it fails if the data includes a quoted entry as the last (i.e. the TOTP Backup Codes) column of the data; any commas in the last quoted entry will be changed to semicolons. This is likely not an acceptable failure mode since it is changing the data set. The following resolves that issue, but introduces a different error that may be tolerable; it adds an extra semicolon at the end of the line:

import re
print (re.sub(r'("[^"]*")(,|(?=\s+))|,', r'\1;', secrets))

This is accomplished by changing the first part of the original alternation member to use alternation itself. That is, the part that was matching the comma after the quoted entry is changed to additionally check for nothing but whitespace (i.e. (,|(?=\s+))), which includes an end of line, after the quoted entry using the following positive lookahead assertion: (?=\s+). The positive lookahead assertion for whitespace is used instead of simply matching whitespace to avoid consuming the whitespace and eliminating it from the resulting output.

1 Comment

I will definitely have a look at this tomorrow, thanks a lot, +1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.