How to distinguish between the empty string "" and null input?

Question

In Python 3 I'm using the csv package

import csv

to read sparse comma-separated CSV files. Among the elements are some that are empty strings

...,"",...

and some that are null

...,,...

I'd like to distinguish between those. How can one do that?

Why? By CSV quoting rules, there is no difference between the two. — chepner
– chepner, Commented Oct 2, 2019 at 15:02
Possible duplicate of csv reader behavior with None and empty string — gstukelj
– gstukelj, Commented Oct 2, 2019 at 15:32
Why? Because the process that produces the CSV is outside my control, and it differentiates between the empty string value and the null value. I'd like to use that difference. (And I do, but by pre- and post-processing outside Python.) To quote Wikipedia, "The CSV file format is not fully standardized." No kidding! — djc
– djc, Commented Oct 3, 2019 at 16:36

ZachL · Accepted Answer · 2020-09-29 09:36:24Z

1

If you substitute a safe placeholder (one that won't match an existing string in the csv file) for a bare pair of double quotes ("") before parsing, you can then substitute back once parsed and identify strings that have a single occurrence of the placeholder:

import csv
lines = [
    '"row1_col1","row1_col2 with ""embedded"" quotes",3,"",,"row1_col6"',
    '"row2_col1","row2_col2",3,"",,"row2_col6"',
]
lines = [line.replace('""', '__PAIR_OF_DOUBLE_QUOTES__') for line in lines]
csv_reader = csv.reader(lines, delimiter=',')
rows = []
for row in csv_reader:
    for col in range(len(row)):
        # empty string is null/none
        if row[col] == '':
            row[col] = None
        # string with just a pair of double quotes is the empty string
        elif row[col] == '__PAIR_OF_DOUBLE_QUOTES__':
            row[col] = ''
        else:
            row[col] = row[col].replace('__PAIR_OF_DOUBLE_QUOTES__', '"')
    rows.append(row)

This results in the following output:

>>> print(json.dumps(rows))
[
    ["row1_col1", "row1_col2 with \"embedded\" quotes", "3", "", null, "row1_col6"],
    ["row2_col1", "row2_col2", "3", "", null, "row2_col6"]
]

answered Sep 29, 2020 at 9:36

ZachL

3612 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

djc Over a year ago

Essentially rewrite the file before reading it as CSV? That's about what the current pre-processing does, but it has to be a bit smart, because the contents of some cells are wildly unpredictable.

ZachL Over a year ago

You're right - it is probably safest to use a uuid4 rather than a display-friendly placeholder. It would be nearly impossible to have collisions that way.

Jan Morawiec · Accepted Answer · 2019-10-02 15:03:08Z

0

This answer suggests it is not possible using the csv module. However, pandas read_csv() has a number of arguments that will allow you to distinguish between NULL and "".

answered Oct 2, 2019 at 15:03

Jan Morawiec

4252 silver badges11 bronze badges

Comments

djc · Accepted Answer · 2019-10-03 16:45:59Z

0

I gather that this simply isn't possible with the csv module, so I'll continue to pre-process CSV files to insert a dummy value for "" (I sure love me some Emacs), then do my deeds with Python, detecting the dummy value and handling it as the empty string and empty values as null. Problem solved, by renting a place to live in Kludge City.

answered Oct 3, 2019 at 16:45

djc

311 silver badge3 bronze badges

Collectives™ on Stack Overflow

How to distinguish between the empty string "" and null input?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related