2

Here is a simple example:

import pandas as pd
from io import StringIO
s = """a   b   c
------------
A1    1    2
A-2  -NA-  3
------------
B-1   2   -NA-
------------
"""
df = pd.read_csv(StringIO(s), sep='\s+', comment='-')
df

a   b   c
0   A1  1.0 2.0
1   A   NaN NaN
2   B   NaN NaN

For lines containing but not starting with the comment specifier, pandas treats the substring from - as comments.


My question is as above.

Not important but just for curiosity, can pandas handle two different types of comment lines: starting with # or -

import pandas as pd
from io import StringIO
s = """a   b   c
# comment line
------------
A1   1    2
A2  -NA-  3
------------
B1   2   -NA-
------------
"""
df = pd.read_csv(StringIO(s), sep='\s+', comment='#-')
df

raises ValueError: Only length-1 comment characters supported

3 Answers 3

3

Another solution: You can "preprocess" the file before .read_csv. For example:

import re
import pandas as pd
from io import StringIO


s = """a   b   c
# comment line
------------
A1    1    2
A-2  -NA-  3
------------
B-1   2   -NA-
------------
"""

df = pd.read_csv(
    StringIO(re.sub(r"^-{2,}", "", s, flags=re.M)), sep=r"\s+", comment="#"
)
print(df)

Prints:

     a     b     c
0   A1     1     2
1  A-2  -NA-     3
2  B-1     2  -NA-
Sign up to request clarification or add additional context in comments.

Comments

1

The csv package supports only one type of comment line. Choose one, and then delete the other. For instance:

df = pd.read_csv(StringIO(s), sep='\s+', comment='-')

This give you

    a        b     c
0   #  comment  line
1  A1        1     2
2  A2      NaN   NaN
3  B1        2   NaN

Now use drop on any row with # starting column a.

2 Comments

it's not a good method. if the comment line # comment line is # comment a b c d (different number of columns as the data), the dataframe shape would change
In that case, we check the specs to see whether hyphen lines are singular. Use a comment marker of # and drop the hyphen lines later. If neither, then the method above won't work.
0

If I understand your problem correctly, I think the best way to solve your problem is to do some pre-processing before you pass your CVS file to pandas.

For example, you could create a temporary CVS file without any comment in them with Regex and pass that to pandas.

2 Comments

My real question is as shown as the first example. For the second one I made preprocess as you said, by using grep to remove lines starting with '#' beforehand. I'm just curious if pandas has a simple solution to deal with it.
Seems the second example distracted you guys.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.