9

I'm having this type of CSV file:

12012;My Name is Mike. What is your's?;3;0 
1522;In my opinion: It's cool; or at least not bad;4;0
21427;Hello. I like this feature!;5;1

I want to get this data into da pandas.DataFrame. But read_csv(sep=";") throws exceptions due to the semicolon in the user generated message column in line 2 (In my opinion: It's cool; or at least not bad). All remaining columns constantly have numeric dtypes.

What is the most convenient method to manage this?

3
  • Can you explain more about you problem? whats your expected output? Commented Jun 17, 2015 at 18:03
  • my intention is to parse this csv data into a DataFrame. But it throws exception because there is a semicolon in one column and pandas thinks it should split it into two columns. Commented Jun 17, 2015 at 18:52
  • 1
    Who is generating these ambiguous files and is there any way to move heaven and earth to get them sane? Commented Jun 17, 2015 at 19:25

1 Answer 1

9

Dealing with unquoted delimiters is always a nuisance. In this case, since it looks like the broken text is known to be surrounded by three correctly-encoded columns, we can recover. TBH, I'd just use the standard Python reader and build a DataFrame once from that:

import csv
import pandas as pd

with open("semi.dat", "r", newline="") as fp:
    reader = csv.reader(fp, delimiter=";")
    rows = [x[:1] + [';'.join(x[1:-2])] + x[-2:] for x in reader] 
    df = pd.DataFrame(rows)

which produces

       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1

Then we can immediately save it and get something quoted correctly:

In [67]: df.to_csv("fixedsemi.dat", sep=";", header=None, index=False)

In [68]: more fixedsemi.dat
12012;My Name is Mike. What is your's?;3;0
1522;"In my opinion: It's cool; or at least not bad";4;0
21427;Hello. I like this feature!;5;1

In [69]: df2 = pd.read_csv("fixedsemi.dat", sep=";", header=None)

In [70]: df2
Out[70]: 
       0                                              1  2  3
0  12012               My Name is Mike. What is your's?  3  0
1   1522  In my opinion: It's cool; or at least not bad  4  0
2  21427                    Hello. I like this feature!  5  1
Sign up to request clarification or add additional context in comments.

2 Comments

Works fine. This is a nice workaround. Thanks! Anyway , is there a way to hook into the pandas parser and do the splitting and joining stuff "on the fly" ?
Is there any better solution for large CSV files? this takes too much time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.