1

I have a large file in CSV,but the result turn to that Error tokenizing data.

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:  
    df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
    all_csv.append(df)

engagement_df = pd.concat(li, axis=0, ignore_index=True)

picture of all_files result
here is the result

2
  • 1
    The answer seems quite clear. One of your files has an unterminated quote. That is, there is a "double-quote" mark without a matching close quotes. You'll need to examine line 110,994 of that file and fix the problem. Commented Sep 22, 2021 at 18:05
  • You could print each filename right before the read to see which one is the problem. Then write a test script that reads just that one (now we have a simple example). Then, look at the csv to see what's odd about it. Try removing rows. Pandas will guess type from the first handfuls of rows but if something later doesn't conform, it can be a problem. Maybe try the python engine instead for more info. Commented Sep 22, 2021 at 18:08

1 Answer 1

1

There's probably an error in one of the CSV files you are reading.
Try using a print statement to figure out which file it is:

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:
    try:  
        df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
        all_csv.append(df)
    except Exception as e:
        print(f"Problem file: {filename} caused Exception: {e}")
        raise

engagement_df = pd.concat(li, axis=0, ignore_index=True)

Alternatively you can try changing the parser "engine" to the Python engine (as documented in this blog):

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:  
    df = pd.read_csv(filename, index_col=None, header=0, sep=',', engine='python')  
    all_csv.append(df)

engagement_df = pd.concat(li, axis=0, ignore_index=True)

But it would be better practice to find the problematic CSV file and fix it. You could also combine the two solutions with something like:

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:
    try:  
        df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
        all_csv.append(df)
    except pd.errors.ParserError as e:
        df = pd.read_csv(filename, index_col=None, header=0, sep=',', engine='python')
        all_csv.append(df)
        print(f"Problem file: {filename} caused Exception: {e}")
        pass

engagement_df = pd.concat(li, axis=0, ignore_index=True)

Or simply skip that file if it's OK to be missing data:

import glob  
import pandas as pd

path = "/Users/LAI/Downloads/learn/engagement_data"  
all_files = glob.glob(path + "/*.csv")  
print(all_files)

all_csv = [ ]  
for filename in all_files:
    try:  
        df = pd.read_csv(filename, index_col=None, header=0, sep=',')  
        all_csv.append(df)
    except pd.errors.ParserError as e:
        print(f"Problem file: {filename} caused Exception: {e}")
        pass

engagement_df = pd.concat(li, axis=0, ignore_index=True)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.