3

In Python 3 I'm using the csv package

import csv

to read sparse comma-separated CSV files. Among the elements are some that are empty strings

...,"",...

and some that are null

...,,...

I'd like to distinguish between those. How can one do that?

3
  • 2
    Why? By CSV quoting rules, there is no difference between the two. Commented Oct 2, 2019 at 15:02
  • Possible duplicate of csv reader behavior with None and empty string Commented Oct 2, 2019 at 15:32
  • Why? Because the process that produces the CSV is outside my control, and it differentiates between the empty string value and the null value. I'd like to use that difference. (And I do, but by pre- and post-processing outside Python.) To quote Wikipedia, "The CSV file format is not fully standardized." No kidding! Commented Oct 3, 2019 at 16:36

3 Answers 3

1

If you substitute a safe placeholder (one that won't match an existing string in the csv file) for a bare pair of double quotes ("") before parsing, you can then substitute back once parsed and identify strings that have a single occurrence of the placeholder:

import csv
lines = [
    '"row1_col1","row1_col2 with ""embedded"" quotes",3,"",,"row1_col6"',
    '"row2_col1","row2_col2",3,"",,"row2_col6"',
]
lines = [line.replace('""', '__PAIR_OF_DOUBLE_QUOTES__') for line in lines]
csv_reader = csv.reader(lines, delimiter=',')
rows = []
for row in csv_reader:
    for col in range(len(row)):
        # empty string is null/none
        if row[col] == '':
            row[col] = None
        # string with just a pair of double quotes is the empty string
        elif row[col] == '__PAIR_OF_DOUBLE_QUOTES__':
            row[col] = ''
        else:
            row[col] = row[col].replace('__PAIR_OF_DOUBLE_QUOTES__', '"')
    rows.append(row)

This results in the following output:

>>> print(json.dumps(rows))
[
    ["row1_col1", "row1_col2 with \"embedded\" quotes", "3", "", null, "row1_col6"],
    ["row2_col1", "row2_col2", "3", "", null, "row2_col6"]
]
Sign up to request clarification or add additional context in comments.

2 Comments

Essentially rewrite the file before reading it as CSV? That's about what the current pre-processing does, but it has to be a bit smart, because the contents of some cells are wildly unpredictable.
You're right - it is probably safest to use a uuid4 rather than a display-friendly placeholder. It would be nearly impossible to have collisions that way.
0

This answer suggests it is not possible using the csv module. However, pandas read_csv() has a number of arguments that will allow you to distinguish between NULL and "".

Comments

0

I gather that this simply isn't possible with the csv module, so I'll continue to pre-process CSV files to insert a dummy value for "" (I sure love me some Emacs), then do my deeds with Python, detecting the dummy value and handling it as the empty string and empty values as null. Problem solved, by renting a place to live in Kludge City.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.