2

I have a 17gb tab separated file and I get the above error when using python/pandas

I am doing the following:

data = pd.read_csv('/tmp/testdata.tsv',sep='\t')

I have also tried adding encoding='utf8' and also tried read_table and various flags, including low_memory=True, but I always get the same error at the same line.

I ran the following on the file:

awk -F"\t" 'FNR==1025974 {print NF}' /tmp/testdata.tsv

An it returns 281 for the number of fields so awk is telling me that line has the correct 281 columns, but read_csv is telling me I have 331.

I also tried the above awk on line 1025973 and 1025975, just to be sure something wasn't relative to zero and they both come back as 281 fields.

What am I missing here?

1
  • I added the flag error_bad_lines=False and it continues through the entire file. In total pandas found 6 rows, out of 16441170 that said had more columns than 281, some had 282, other had 300, so it was not a consistent number. What I am looking for is a tool or diagnostic that I can run against those rows that will split them apart like read_csv does, as awk comes back and tells me all the rows have 281 columns. Commented Apr 20, 2016 at 18:36

1 Answer 1

1

So to debug this, I took my header line, then took the single line from above and ran it through read_csv. I then got another error:

Error tokenizing data. C error: EOF inside string starting at line 1

The problem turned out to be that, by default, read_csv will look for a closing double quote if it sees a double quote immediately after the delimiter.

I incorrectly assumed that if I specified sep="\t" it would split only on tabs and not care about any other characters.

Long story short, to fix this, add the following flag to read_csv

quoting=3 which is QUOTE_NONE.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.