0

Good morning. I'm working on a project and am having issues reading a CSV on a file with more than 36,000 rows. When I run it normally:

df = pd.read_csv('File.csv', encoding='ISO-8859-1', dtype='str')
#FYI, this is the only encoding to get anywhere. Plus, I downloaded this report from Salesforce, and I ensured the encodings match.
ParserError: Error tokenizing data. C error: EOF inside string starting at row 21276

To test this, I run this cell:

import csv

file_path = 'File.csv'

with open(file_path, 'r', encoding='ISO-8859-1') as file:
    reader = csv.reader(file)
    for i, line in enumerate(reader):
        if i == 21275:  # since Python is zero-indexed
            print(f'Line 21276: {line}')
            x = line
            print(x)
            print(len(x))
            break

Here is the output (actual info replaced with placeholder).

Line 21276: ['FIRST LAST', 'Company', '12345', 'Place', 'Place', 'Place', '123456', '123456', '1_2_34', '', 'Location', '', 'Thing', 'None', '', '12.345', '0.000', 'Something', 'Status', '1/2/2034 12:00 AM', '3/45/6789', '', '', '', '', '']
26

I've opened up the CSV file on Excel and the columns run from A to Z, so that's 26 columns. Everything lines up.

I've tried this and got a different issue:

df = pd.read_csv('File.csv', encoding='ISO-8859-1', dtype='str', quoting=csv.QUOTE_NONE)
ParserError: Error tokenizing data. C error: Expected 27 fields in line 535, saw 28

I diagnosed this similarly.

with open(file_path, 'r', encoding='ISO-8859-1') as file:
    for i, line in enumerate(file):
        if i == 534:  # since Python is zero-indexed
            print(f'Line 535: {line}')
            x = line
            print(len(x.split(',')))
            break

Line 535: "FIRST LAST","Company","12345","Place","Place","Place","12345","12345","3_4_56","","Place","","Thing","None","","0000.000","0.000","Something","Word","2/34/5678 12:00 PM","","","","0000.00","",""

28

It looks like adding the QUOTE_NONE progresses troubleshooting, but adds columns. I checked this row in Excel and verified there are only 26 columns.

Any help is appreciated. Thank you.

5
  • This is possibly similar to stackoverflow.com/questions/18016037/… - do you have a string with a single quote mark inside your CSV somewhere? That would also explain why QUOTE_NONE helps. You can also try the on_bad_lines='skip' option for read_csv, but you probably want to fix your data source instead as that may skip a line. Commented May 15, 2024 at 17:11
  • Maybe try setting engine='python'? Not sure how it's different to the default parsing engine, but it seems to be able to solve similar issues. Commented May 15, 2024 at 17:16
  • So my CSV file has over 30,000 rows. When I used the find function in Excel, I found three rows that had the double quotes. There are a few apostrophes too, when it relates to a business (ex: McDonald's). Commented May 15, 2024 at 17:17
  • @IgnatiusReilly Interesting. So when I tried that, I got a new error I've never seen before. ParserError: unexpected end of data Commented May 15, 2024 at 17:20
  • Thank you. I fixed the error. It looked like it was a poor download attempt. It looks like it failed at some point and had much fewer rows than it should have. Re-downloading this allowed me to read_csv normally. Commented May 15, 2024 at 17:35

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.