3

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.

def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):

        df = pd.read_csv(input_file_name,
                         header=None,
                         nrows=size_of_chunk,
                         skiprows=eachRow,
                         low_memory=False,
                         error_bad_lines=False,
                         sep=',')
                         # engine='python'
                         # quoting=csv.QUOTE_NONE
                         # encoding='utf-8'

        df.columns = header_row
        df = df.drop_duplicates(keep='first')
        df = df.apply(lambda x: x.astype(str).str.lower())

        return df

I'm then calling this function within a loop and works just fine.

huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)

I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as

engine='python'

quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why

encoding='utf-8'

but none of it worked, its still throwing the following error

Error:

Traceback (most recent call last):
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
    huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
  File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
    sep=',')
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
    data = parser.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
    ret = self._engine.read(nrows)
  File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
    data = self._reader.read(nrows)
  File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
  File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
  File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
  File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>> 
2
  • can you show us a valid row and the invalid row (the second last you have removed) Commented Oct 19, 2017 at 9:00
  • I cannot paste that here it has 171 columns and it looks like normal row but when pandas is reading it, it throws the above mentioned error on the second last line of of file. Commented Oct 19, 2017 at 9:08

2 Answers 2

8

If you are under linux, try to remove all non printable caracter. Try to load your file after this operation.

tr -dc '[:print:]\n' < file > newfile
Sign up to request clarification or add additional context in comments.

5 Comments

I'm under windows
can i still do that ?
stackoverflow.com/questions/92438/… (you can try this solution)
how can i do this in pandas dataframe and on windows not linux ?
5

I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:

pd.read_csv(file,engine='python', error_bad_lines=False) 

#engine='python' provides a better output

2 Comments

This also worked for me. Here's another resource that agrees with this answer: shanelynn.ie/…
error_bad_lines is deprecated, instead use on_bad_lines = 'warn' or 'skip'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.