Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

Question

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.

def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):

        df = pd.read_csv(input_file_name,
                         header=None,
                         nrows=size_of_chunk,
                         skiprows=eachRow,
                         low_memory=False,
                         error_bad_lines=False,
                         sep=',')
                         # engine='python'
                         # quoting=csv.QUOTE_NONE
                         # encoding='utf-8'

        df.columns = header_row
        df = df.drop_duplicates(keep='first')
        df = df.apply(lambda x: x.astype(str).str.lower())

        return df

I'm then calling this function within a loop and works just fine.

huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)

I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as

engine='python'

quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why

encoding='utf-8'

but none of it worked, its still throwing the following error

Error:

Traceback (most recent call last):
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
    huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
    sep=',')
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
  File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
  File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
  File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>>

can you show us a valid row and the invalid row (the second last you have removed) — Indent
– Indent, Commented Oct 19, 2017 at 9:00
I cannot paste that here it has 171 columns and it looks like normal row but when pandas is reading it, it throws the above mentioned error on the second last line of of file. — Wcan
– Wcan, Commented Oct 19, 2017 at 9:08

Indent · Accepted Answer · 2017-10-19 09:03:51Z

8

If you are under linux, try to remove all non printable caracter. Try to load your file after this operation.

tr -dc '[:print:]\n' < file > newfile

answered Oct 19, 2017 at 9:03

Indent

4,9671 gold badge22 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Wcan Over a year ago

I'm under windows

Wcan Over a year ago

can i still do that ?

Indent Over a year ago

stackoverflow.com/questions/92438/… (you can try this solution)

Wcan Over a year ago

how can i do this in pandas dataframe and on windows not linux ?

Indent Over a year ago

try to install UnxTools sourceforge.net/projects/unxutils/?SetFreedomCookie

Benkerroum Mohamed · Accepted Answer · 2020-02-18 15:55:44Z

5

I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:

pd.read_csv(file,engine='python', error_bad_lines=False)

#engine='python' provides a better output

edited Feb 18, 2020 at 15:55

Benkerroum Mohamed

1,9363 gold badges15 silver badges22 bronze badges

answered Feb 18, 2020 at 15:37

Carlos Chaccon

1693 silver badges4 bronze badges

2 Comments

Brad123 Over a year ago

This also worked for me. Here's another resource that agrees with this answer: shanelynn.ie/…

yueyanw Over a year ago

error_bad_lines is deprecated, instead use on_bad_lines = 'warn' or 'skip'

Collectives™ on Stack Overflow

Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

2 Answers 2

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related