8

This is quite the opposite of Get pandas.read_csv to read empty values as empty string instead of nan

Given the following CSV file:

col,val
"hi
there",1
,2
\f\,3
"",4
"""hi""",5

I want it to be read as:

         col  val
0  hi\nthere    1
1        NaN    2
2        \f\    3
3               4
4       "hi"    5

That is, reading the empty field (val 2) as NaN, while keeping the empty string (val 4) as empty string.

Currently pd.read_csv converts val 2 and val 4 both as NaN, or if I use na_filter=False both are kept as empty string.

I'm assuming these two representations mean different things in CSV (empty fields vs empty string), so I'm assuming pandas should be able to distinguish this too.

Is there a way to make pandas to distinguish these two cases? Or is my assumption wrong, that the two representations are actually the same? (please point me to a CSV standard if the second one is the case)

More information, I got the CSV by exporting BigQuery table (with the intended meaning, val 2 is null and val 4 is empty string) into CSV. And I want to get the exact same table back. So this example is not just a contrived example, but is actually used by BigQuery when exporting to CSV.

EDIT: further search reveals a Github issue 4 years ago that discusses similar point (see this comment, for example), and one of the commenter mentions that there is some coercion (I'm not sure what they refer to, but I understand it as coercion between empty field and empty string). Is this still happening?

9
  • if you use the csv module, and print out each line, is there a difference between the two? That would tell you if pandas has any chance of reading them differently. Commented Nov 30, 2019 at 0:48
  • 1
    Perhaps that is equivalent to my question, since pandas uses csv module under the hood: "How to get csv module to read empty field and empty string differently?" (note, I haven't tried it yet) Commented Dec 2, 2019 at 2:37
  • 1
    What pandas version are you using? This works as expected on 0.25. Commented Dec 9, 2019 at 7:31
  • I'm using 0.25.1, Python 3, on Ubuntu. Can you clarify what you mean by works as expected? As in, it works as I want it to work in my question? Commented Dec 10, 2019 at 4:56
  • 2
    There's one part of your question that is not standard, which is that you want to mix column datatypes. NaN is a floating point value; so the null string should not be NaN. I totally understand many use-cases for this behavior---on the other hand, it would be better to implement this in a way that is more definite (and that libraries will more likely support): have another column which tells you whether or not the other column is null. This will work regardless of datatype, and avoids having to use a valid value of your domain (empty string) as an indicator for null. Commented Dec 16, 2019 at 5:24

4 Answers 4

3
+50

Another option would be do disable the quoting to get the fields where an empty string is present and where nothing is present. The problem in this case are the entries which include new line characters in the test. We would need to remove these chars first and merge the lines to create a new data file.

When reading the new data file with quoting off, empty values are NaN and empty strings are two quotes. This dataframe then can be used to set the NaN in the original dataframe to set the real NaNs.

import numpy as np
import pandas as pd

with open('./data.csv') as f:
    lines = f.readlines()

# merge lines where the comma is missing
it = iter(lines)
lines2 = [x if ',' in x else x + next(it) for x in it]
# replace \n which are not at the end of the line
lines3 = [l.replace('\n','') + '\n' for l in lines2]
# write new file with merged lines
with open('./data_merged.csv', 'w+') as f:
    f.writelines(lines3)


# read original data
df = pd.read_csv('./data.csv', na_filter=False)
# read merged lines data with quoting off
df_merged = pd.read_csv('./data_merged.csv', quoting=3)

# in df_merged dataframe if is NaN it is a real NaN
# set lines in original df to NaN when in df_merged is NaN
df.loc[df_merged.col.isna(), 'col'] = np.NaN
Sign up to request clarification or add additional context in comments.

4 Comments

Cool trick. This does seem to solve the problem, at the expense of twice the file size and twice the reading time. I still prefer a solution that does not require writing a copy of the data, but this seems like the best contender so far! Thanks mjspier.
This seems to be the best answer so far (preserve the newlines instead of replacing them), although I'm still looking for better solution.
Actually I think there might be a big problem with this solution. The merging of the lines might be not correct when a comma is in the text itself. I think replacing the empty string "" at the beginning of a line might be a more stable solution.
You are right in that aspect, but I think trying to handle newlines in quoted CSV correctly is itself already a CSV parser, so at this point the best answer is still "use QUOTE_NONE with some handling of newlines". And since yours came first, I gave the +50 to you.
3

pandas.read_csv accepts a quoting argument that controls quoting behavior for each field. The argument accepts values of type int or csv.QUOTE_*. The latter are constants defined in the csv module. Of all the available options, the one to take note is csv.QUOTE_NONE. This constant instructs the reader object to perform no special processing of quote characters, which means that fields in double quotes are read as they are and no additional double quotes are added to the fields while parsing. The default value set by pandas is csv.QUOTE_MINIMAL.

In [237]: import csv
In [238]: import pandas as pd
In [239]: df = pd.read_csv("test.csv", quoting=csv.QUOTE_NONE)

In [240]: df
Out[240]: 
        col  val
0       "hi  NaN
1    there"  1.0
2       NaN  2.0
3       \f\  3.0
4        ""  4.0
5  """hi"""  5.0

With no special quoting, null values are parsed as NaN and empty strings with double-quotes are left as they are.

But there is a problem with this approach: if any field contains newlines in double-quotes, they are treated as separate strings. This is evident in the first line in the csv file where "hi\nthere" are parsed in separate rows by pandas. To get around this problem, I first performed some pre-processing with the re module. This was required to replace any newline characters in double-quote strings to whitepace. Then I wrote back to the same file and used it again as above in read_csv. As I'm not aware of the full format of your data, there may be more regex required as necessary. However, for the given problem, I get the desired output.

In [314]: with open("test.csv", 'r+') as f:
     ...:     data = f.read()
     ...:     import re
     ...:     pattern = re.compile(r'".*?"', re.DOTALL)
     ...:     data = pattern.sub(lambda x: x.group().replace('\n', ' '), data)
     ...:     f.seek(0)
     ...:     f.write(data)

In [315]: df = pd.read_csv("test.csv", quoting=csv.QUOTE_NONE)

In [316]: df
Out[316]: 
          col  val
0  "hi there"    1
1         NaN    2
2         \f\    3
3          ""    4
4    """hi"""    5

3 Comments

I think if you pass in a file object instead of a path that is opened with newlines='' as recommended in the csv module docs, you won't have the problem with newlines in quoted strings. I.e. with open("test.csv", "r+", newlines="") as f and pd.read_csv(f, ...).
If what alkasm said is true, that would be a good solution, just need to do post-processing to all fields to remove initial and end quotes, then replace consecutive double quotes with one double quote.
Dang! Well, you could read it into a list of rows first with the standard csv module and then construct the dataframe, which works but you lose all the other pd.read_csv() keywords.
1

Here's a bit ugly but complete answer:

import io
import re
import pandas as pd

with open('overflow.csv', 'r') as f:
    with io.StringIO(re.sub(r'(^"",)', "EMPTY_STR,", f.read(), flags=re.MULTILINE)) as ff:
        with io.StringIO(re.sub(r'(,"",)', ",EMPTY_STR,", ff.read(), flags=re.MULTILINE)) as fff:
            with io.StringIO(re.sub(r'(,""$)', ",EMPTY_STR", fff.read(), flags=re.MULTILINE)) as ffff:
                df = pd.read_csv(ffff)

df= df.replace('EMPTY_STR', '')

The re.sub() replaces the empty string with EMPTY_STR which can later be replaced back with an actual empty string. It has to be called three times for all three possible types of occurrences (beginning, middle and and of the line).

Truly empty cells are left alone and indeed interpreted as NaN.

1 Comment

With regex solution my worry is that it might replace not just the intended one, but also others, because quotes inside the field are also escaped as two double quotes. For example, if the first field is a,",b, and the second one is cd, then in the CSV it will be "a,"",b",cd, so your middle regex matches this one even though it is actually part of the field. OTOH, even though this answer has its weakness, I can definitely check whether my data contain such construction. (btw, I believe there should be a way to use only 1 regex)
0

Is there any way for you to replace the empty strings with something else when creating the BigQuery csv export? Like replace "" with "EMPTY_STR"? You could then use a converter function to replace those back to an empty string when using .read_csv().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.