Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331

Question

I have a 17gb tab separated file and I get the above error when using python/pandas

I am doing the following:

data = pd.read_csv('/tmp/testdata.tsv',sep='\t')

I have also tried adding encoding='utf8' and also tried read_table and various flags, including low_memory=True, but I always get the same error at the same line.

I ran the following on the file:

awk -F"\t" 'FNR==1025974 {print NF}' /tmp/testdata.tsv

An it returns 281 for the number of fields so awk is telling me that line has the correct 281 columns, but read_csv is telling me I have 331.

I also tried the above awk on line 1025973 and 1025975, just to be sure something wasn't relative to zero and they both come back as 281 fields.

What am I missing here?

I added the flag error_bad_lines=False and it continues through the entire file. In total pandas found 6 rows, out of 16441170 that said had more columns than 281, some had 282, other had 300, so it was not a consistent number. What I am looking for is a tool or diagnostic that I can run against those rows that will split them apart like read_csv does, as awk comes back and tells me all the rows have 281 columns. — Severun
– Severun, Commented Apr 20, 2016 at 18:36

Severun · Accepted Answer · 2016-04-21 18:44:16Z

1

So to debug this, I took my header line, then took the single line from above and ran it through read_csv. I then got another error:

Error tokenizing data. C error: EOF inside string starting at line 1

The problem turned out to be that, by default, read_csv will look for a closing double quote if it sees a double quote immediately after the delimiter.

I incorrectly assumed that if I specified sep="\t" it would split only on tabs and not care about any other characters.

Long story short, to fix this, add the following flag to read_csv

quoting=3 which is QUOTE_NONE.

answered Apr 21, 2016 at 18:44

Severun

2,9261 gold badge18 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Getting CParserError: Error tokenizing data. C error: Expected 281 fields in line 1025974, saw 331

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related