python csv dealing with comma inside column

Question

Dealing with csv file that has text data of novels.

book_id, title, content
1, book title 1, All Passion Spent is written in three parts, primarily from the view of an intimate observer. 
2, Book Title 2,  In particular Mr FitzGeorge, a forgotten acquaintance from India who has ever since been in love with her, introduces himself and they form a quiet but playful and understanding friendship. It cost 3,4234 to travel.

Text in content column have commas and unfortunately when you try to use pandas.read_csv you get pandas.errors.ParserError: Error tokenizing data. C error:

There are some solutions to this problem SO but none of them worked. Tried to read as a regular file and then passed to data frame failed. SO - Solution

You are getting the error because there is an extra comma in been in love with her, introduces h — Rakesh
– Rakesh, Commented May 3, 2018 at 15:38
Can you replace the first 2 commas with a random delimiters like @ and change the default delimiter in the csv parser? pandas.csv_reaser(filename, sep='@') and line.replace(',', '@', maxreplace=2). If there is comma in title, you'll need a regex replace to match the title. — TwistedSim
– TwistedSim, Commented May 3, 2018 at 15:39
@Rakesh basically index mismatch right more columns than what is in the header. — add-semi-colons
– add-semi-colons, Commented May 3, 2018 at 15:43

Rakesh · Accepted Answer · 2018-05-03 16:00:15Z

1

You can try reading your file and then spliting the content using str.split(",", 2) and then convert the result to a DF.

Ex:

import pandas as pd
content = []
with open(filename, "r") as infile:
    header = infile.readline().strip().split(",")
    content = [i.strip().split(",", 2) for i in infile.readlines()]

df = pd.DataFrame(content, columns=header)
print(df)

Output:

  book_id          title                                            content
0       1   book title 1   All Passion Spent is written in three parts, ...
1       2   Book Title 2    In particular Mr FitzGeorge, a forgotten acq...

edited May 3, 2018 at 16:00

answered May 3, 2018 at 15:56

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tdelaney Over a year ago

I like, except you could content = [i.strip().split(",", 2) for i in infile] to reduce the memory used by the intermediate data list.

Rakesh Over a year ago

@tdelaney Thanks

add-semi-colons Over a year ago

@Rakesh this was an example i do have more columns (20) in that case how will split work

Collectives™ on Stack Overflow

python csv dealing with comma inside column

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related