2

i have following CSV with following entries

"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"

The issue comes when i try to read 8 inches", i am unable to read the csv using read_csv().

Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
                                      quoting=1,
                                      engine='c', error_bad_lines=False, warn_bad_lines=True,
                                      encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')

Is there a way to handle the quote within the cell.

4
  • What is the error? It's really an issue with the .csv file. I would maybe run a script over the input file to fix the quoting issue. Are there two types of separators (| and ,) as well? Or is the entire entry between the final bar and the end of the line a single column? Can you include the converter you're using? Commented Oct 15, 2019 at 17:24
  • @mgrollins: there is only | as seperator, the issue is actually with csv file, but this a special case, and i am getting a double quote within the string Commented Oct 15, 2019 at 17:27
  • @mgrollins: this is the error Exception while performing pandas.read_csv operation. error: Error tokenizing data. C error: EOF inside string starting at row 0, pandas config: Commented Oct 15, 2019 at 17:29
  • Can you clean up the input from obj['Body'] before trying to read it in? Are you sure there are no null characters in the first row prior to the end of the line? Commented Oct 15, 2019 at 17:38

2 Answers 2

2

Start from passing appropriate parameters for this case:

  1. sep='[|,]' - there are two separators: a pipe char and a comma, so define them as a regex.
  2. skipinitialspace=True - your source text contains extra spaces (after separators), so you should drop them.
  3. engine='python' - to suppress a warning concerning Falling back to the 'python' engine.

The above options alone allow to call read_csv with no error, but the downside (for now) is that double quotes remain.

To eliminate them, at least from the data rows, another trick is needed:

Define a converter (lambda) function:

cnv = lambda txt: txt.replace('"', '')

and apply it to all source columns.

In your case you have 5 columns, so to keep the code concise, you can use a dictionary comprehension:

{ i: cnv for i in range(5) }

So the whole code can be:

df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
    engine='python', converters={ i: cnv for i in range(5) })

and the result is:

  "column1"  "column2"       "column3"  "column4"  "column5"
0      123    sometext   this somedata   8 inches      hello

But remember that now all columns are of string type, so you should convert required columns to numbers. An alternative is to pass second converter for numeric columns, returning a number instead of a string.

To have proper column names (without double quotes), you can pass additional parameters:

  • skiprows=1 - to omit the initial line,
  • names=["column1", "column2", "column3", "column4", "column5"] - to define the column list on your own.
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the info, lets assume i have proper converter , i am failing for datetime columns
If your converter for datetime fields fails then most likely the problem is just in this converter. To investigate the issue, call such converter for a few cases of input data, look at the results and correct the converter to return proper results.
0

We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:

# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5" 
        "123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)


df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]

The result:

  column1   column2        column3   column4 column5
0     123  sometext  this somedata  8 inches   hello

1 Comment

i am getting error for the last column, which in my real case is a date time column, you can say my column5 is of type datetime, i am getting this error Exception while performing pandas.read_csv operation. error: 'RECORD_CREATED_TIMESTAMP' is not in list, although i have it my convertor list

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.