Python: eliminate extra comma (Error tokenizing data. C error: Expected 3 fields in line 29, saw 4)

Question

The error cause by 'Food, Beverage & Tobacco' which has extra comma that cause pandas unable to read the csv file. it cause error

Error tokenizing data. C error: Expected 3 fields in line 29, saw 4

How can I elegantly eliminate extra comma in the csv file for 'GICS industry group'(including condition beside the comma is behind Food)?

Here is my code:

#!/usr/bin/env python2.7
print "hello from python 2"

import pandas as pd
from lxml import html
import requests
import urllib2
import os


url = 'http://www.asx.com.au/asx/research/ASXListedCompanies.csv'

response = urllib2.urlopen(url)
html = response.read()
#html = html.replace('"','')

with open('asxtest.csv', 'wb') as f:
    f.write(html)

with open("asxtest.csv",'r') as f:
    with open("asx.csv",'w') as f1:
        f.next()#skip header line
        f.next()#skip 2nd line
        for line in f:
             if line.count(',')>2:
                 line[2] = 'Food Beverage & Tobacco'
             f1.write(line)

os.remove('asxtest.csv')

df_api = pd.read_csv('asx.csv')
df_api.rename(columns={'Company name': 'Company', 'ASX code': 'Stock','GICS industry group': 'Industry'}, inplace=True)

How about df_api = pd.read_csv(url, skiprows=1, names=['Company', 'Stock', 'Industry']) — cs95
– cs95, Commented Jan 22, 2018 at 14:38
You have a mal-formed CSV file. Basically, you will need to clean it up before hand — James
– James, Commented Jan 22, 2018 at 14:38
Hmm, I did not notice the problem with the data not being loaded earlier. I've deleted my answer. — cs95
– cs95, Commented Jan 22, 2018 at 14:59

James · Accepted Answer · 2018-01-22 15:13:13Z

The file from the URL in your post contains additional commas for some items in the GICS industry group column. The first occurs at line 31 in the file:

ABUNDANT PRODUCE LIMITED,ABT,Food, Beverage & Tobacco

Normally, the 3rd item should be surrounded by quotes to escape breaking on the comma, such as:

ABUNDANT PRODUCE LIMITED,ABT,"Food, Beverage & Tobacco"

For this situation, because the first 2 columns appear to be clean, you can merge any additional text into the 3rd field. After this cleaning, load it into a data frame.

You can do this with a generator that will pull out and clean each line one at a time. The pd.DataFrame constructor will read in the data and create a data frame.

import pandas as pd

def merge_last(file_name, skip_lines=0):
    with open(file_name, 'r') as fp:
        for i, line in enumerate(fp):
            if i < 2:
                continue
            x, y, *z = line.strip().split(',')
            yield (x,y,','.join(z))

# create a generator to clean the lines, skipping the first 2
gen = merge_last('ASXListedCompanies.csv', 2)
# get the column names
header = next(gen)
# create the data frame
df = pd.DataFrame(gen, columns=header)

df.head()

returns:

          Company name ASX code                 GICS industry group
0          MOQ LIMITED      MOQ                 Software & Services
1       1-PAGE LIMITED      1PG                 Software & Services
2  1300 SMILES LIMITED      ONT    Health Care Equipment & Services
3    1ST GROUP LIMITED      1ST    Health Care Equipment & Services
4         333D LIMITED      T3D  Commercial & Professional Services

And the rows with the extra commas are preserved:

df.loc[27:30]
# returns:
                           Company name ASX code       GICS industry group
27             ABUNDANT PRODUCE LIMITED      ABT  Food, Beverage & Tobacco
28                  ACACIA COAL LIMITED      AJC                    Energy
29  ACADEMIES AUSTRALASIA GROUP LIMITED      AKG         Consumer Services
30         ACCELERATE RESOURCES LIMITED      AX8                Class Pend

Here is a more generalized generator that will merge after a given number of columns:

def merge_last(file_name, merge_after_col=2, skip_lines=0):
    with open(file_name, 'r') as fp:
        for i, line in enumerate(fp):
            if i < 2:
                continue
            spl = line.strip().split(',')
            yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:]))

Consider the case where this problem arises in some in-between column say column 3 one and after that extra comma column(i.e. 3 col) we have few more column to deal with consider we have total 10 columns which we can not merge into one right as it is not the last column. then how can we deal with that.
You should probably ask that as a new question on stackoverflow.

Collectives™ on Stack Overflow

Python: eliminate extra comma (Error tokenizing data. C error: Expected 3 fields in line 29, saw 4)

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related