handle comment lines when reading csv using pandas

Question

Here is a simple example:

import pandas as pd
from io import StringIO
s = """a   b   c
------------
A1    1    2
A-2  -NA-  3
------------
B-1   2   -NA-
------------
"""
df = pd.read_csv(StringIO(s), sep='\s+', comment='-')
df

a   b   c
0   A1  1.0 2.0
1   A   NaN NaN
2   B   NaN NaN

For lines containing but not starting with the comment specifier, pandas treats the substring from - as comments.

My question is as above.

Not important but just for curiosity, can pandas handle two different types of comment lines: starting with # or -

import pandas as pd
from io import StringIO
s = """a   b   c
# comment line
------------
A1   1    2
A2  -NA-  3
------------
B1   2   -NA-
------------
"""
df = pd.read_csv(StringIO(s), sep='\s+', comment='#-')
df

raises ValueError: Only length-1 comment characters supported

Andrej Kesely · Accepted Answer · 2021-06-03 22:38:34Z

3

Another solution: You can "preprocess" the file before .read_csv. For example:

import re
import pandas as pd
from io import StringIO


s = """a   b   c
# comment line
------------
A1    1    2
A-2  -NA-  3
------------
B-1   2   -NA-
------------
"""

df = pd.read_csv(
    StringIO(re.sub(r"^-{2,}", "", s, flags=re.M)), sep=r"\s+", comment="#"
)
print(df)

Prints:

     a     b     c
0   A1     1     2
1  A-2  -NA-     3
2  B-1     2  -NA-

answered Jun 3, 2021 at 22:38

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Prune · Accepted Answer · 2021-06-03 22:32:14Z

1

The csv package supports only one type of comment line. Choose one, and then delete the other. For instance:

df = pd.read_csv(StringIO(s), sep='\s+', comment='-')

This give you

    a        b     c
0   #  comment  line
1  A1        1     2
2  A2      NaN   NaN
3  B1        2   NaN

Now use drop on any row with # starting column a.

answered Jun 3, 2021 at 22:32

Prune

78k14 gold badges63 silver badges83 bronze badges

2 Comments

wsdzbm Over a year ago

it's not a good method. if the comment line # comment line is # comment a b c d (different number of columns as the data), the dataframe shape would change

Prune Over a year ago

In that case, we check the specs to see whether hyphen lines are singular. Use a comment marker of # and drop the hyphen lines later. If neither, then the method above won't work.

Spectric · Accepted Answer · 2021-06-04 01:24:23Z

0

If I understand your problem correctly, I think the best way to solve your problem is to do some pre-processing before you pass your CVS file to pandas.

For example, you could create a temporary CVS file without any comment in them with Regex and pass that to pandas.

edited Jun 4, 2021 at 1:24

Spectric

32.6k6 gold badges32 silver badges56 bronze badges

answered Jun 3, 2021 at 22:36

Vahid Shahrivari

1382 silver badges8 bronze badges

2 Comments

wsdzbm Over a year ago

My real question is as shown as the first example. For the second one I made preprocess as you said, by using grep to remove lines starting with '#' beforehand. I'm just curious if pandas has a simple solution to deal with it.

wsdzbm Over a year ago

Seems the second example distracted you guys.

Collectives™ on Stack Overflow

handle comment lines when reading csv using pandas

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related