1

Dealing with csv file that has text data of novels.

book_id, title, content
1, book title 1, All Passion Spent is written in three parts, primarily from the view of an intimate observer. 
2, Book Title 2,  In particular Mr FitzGeorge, a forgotten acquaintance from India who has ever since been in love with her, introduces himself and they form a quiet but playful and understanding friendship. It cost 3,4234 to travel. 

Text in content column have commas and unfortunately when you try to use pandas.read_csv you get pandas.errors.ParserError: Error tokenizing data. C error:

There are some solutions to this problem SO but none of them worked. Tried to read as a regular file and then passed to data frame failed. SO - Solution

7
  • 1
    Are there ever commas in the id or title? Commented May 3, 2018 at 15:36
  • You are getting the error because there is an extra comma in been in love with her, introduces h Commented May 3, 2018 at 15:38
  • Can you replace the first 2 commas with a random delimiters like @ and change the default delimiter in the csv parser? pandas.csv_reaser(filename, sep='@') and line.replace(',', '@', maxreplace=2). If there is comma in title, you'll need a regex replace to match the title. Commented May 3, 2018 at 15:39
  • @chrisz there can be separators in the title Commented May 3, 2018 at 15:41
  • @Rakesh basically index mismatch right more columns than what is in the header. Commented May 3, 2018 at 15:43

1 Answer 1

1

You can try reading your file and then spliting the content using str.split(",", 2) and then convert the result to a DF.

Ex:

import pandas as pd
content = []
with open(filename, "r") as infile:
    header = infile.readline().strip().split(",")
    content = [i.strip().split(",", 2) for i in infile.readlines()]

df = pd.DataFrame(content, columns=header)
print(df)

Output:

  book_id          title                                            content
0       1   book title 1   All Passion Spent is written in three parts, ...
1       2   Book Title 2    In particular Mr FitzGeorge, a forgotten acq...
Sign up to request clarification or add additional context in comments.

3 Comments

I like, except you could content = [i.strip().split(",", 2) for i in infile] to reduce the memory used by the intermediate data list.
@tdelaney Thanks
@Rakesh this was an example i do have more columns (20) in that case how will split work

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.