Pandas Unable to Read CSV file using pandas, with extra quote char

Question

i have following CSV with following entries

"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"

The issue comes when i try to read 8 inches", i am unable to read the csv using read_csv().

Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
                                      quoting=1,
                                      engine='c', error_bad_lines=False, warn_bad_lines=True,
                                      encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')

Is there a way to handle the quote within the cell.

What is the error? It's really an issue with the .csv file. I would maybe run a script over the input file to fix the quoting issue. Are there two types of separators (| and ,) as well? Or is the entire entry between the final bar and the end of the line a single column? Can you include the converter you're using? — mgrollins
– mgrollins, Commented Oct 15, 2019 at 17:24
@mgrollins: there is only | as seperator, the issue is actually with csv file, but this a special case, and i am getting a double quote within the string — noobie-php
– noobie-php, Commented Oct 15, 2019 at 17:27
@mgrollins: this is the error Exception while performing pandas.read_csv operation. error: Error tokenizing data. C error: EOF inside string starting at row 0, pandas config: — noobie-php
– noobie-php, Commented Oct 15, 2019 at 17:29
Can you clean up the input from obj['Body'] before trying to read it in? Are you sure there are no null characters in the first row prior to the end of the line? — mgrollins
– mgrollins, Commented Oct 15, 2019 at 17:38

Valdi_Bo · Accepted Answer · 2019-10-15 19:43:21Z

2

Start from passing appropriate parameters for this case:

sep='[|,]' - there are two separators: a pipe char and a comma, so define them as a regex.
skipinitialspace=True - your source text contains extra spaces (after separators), so you should drop them.
engine='python' - to suppress a warning concerning Falling back to the 'python' engine.

The above options alone allow to call read_csv with no error, but the downside (for now) is that double quotes remain.

To eliminate them, at least from the data rows, another trick is needed:

Define a converter (lambda) function:

cnv = lambda txt: txt.replace('"', '')

and apply it to all source columns.

In your case you have 5 columns, so to keep the code concise, you can use a dictionary comprehension:

{ i: cnv for i in range(5) }

So the whole code can be:

df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
    engine='python', converters={ i: cnv for i in range(5) })

and the result is:

  "column1"  "column2"       "column3"  "column4"  "column5"
0      123    sometext   this somedata   8 inches      hello

But remember that now all columns are of string type, so you should convert required columns to numbers. An alternative is to pass second converter for numeric columns, returning a number instead of a string.

To have proper column names (without double quotes), you can pass additional parameters:

skiprows=1 - to omit the initial line,
names=["column1", "column2", "column3", "column4", "column5"] - to define the column list on your own.

answered Oct 15, 2019 at 19:43

Valdi_Bo

31.1k4 gold badges29 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

noobie-php Over a year ago

Thanks for the info, lets assume i have proper converter , i am failing for datetime columns

Valdi_Bo Over a year ago

If your converter for datetime fields fails then most likely the problem is just in this converter. To investigate the issue, call such converter for a few cases of input data, look at the results and correct the converter to return proper results.

kantal · Accepted Answer · 2019-10-15 19:42:42Z

0

We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:

# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5" 
        "123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)


df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]

The result:

  column1   column2        column3   column4 column5
0     123  sometext  this somedata  8 inches   hello

answered Oct 15, 2019 at 19:42

kantal

2,4072 gold badges10 silver badges16 bronze badges

1 Comment

noobie-php Over a year ago

i am getting error for the last column, which in my real case is a date time column, you can say my column5 is of type datetime, i am getting this error Exception while performing pandas.read_csv operation. error: 'RECORD_CREATED_TIMESTAMP' is not in list, although i have it my convertor list

Collectives™ on Stack Overflow

Pandas Unable to Read CSV file using pandas, with extra quote char

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related