0

I have a csv file similar to this but with about 155,000 rows with years from 1910-2010 and 83 different station id's:

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       0TT    0     0   0   0   0 
216565       2008      8    SNOW        0     0T    0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0   0

and I want to replace any value that has a pattern of a number and then one letter or a number and then two letter with NaN.

My desired output then is:

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       NaN    0     0   0   0   0 
216565       2008      8    SNOW        0     NaN   0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0   0

I have tried to use:

replace=df.replace([r'[0-9] [A-Z]'], ['NA']) replace2=replace.replace([r'[0-9][A-Z][A-Z]'], ['NA'])

I was hoping by using the pattern of [0-9] [A-Z] would take care of a number and just one letter and then [0-9][A-Z][A-Z] would replace any cells with 2 letters but the file stays the exact same even though no errors are returned.

Any help would be much appreciated.

4
  • Why do you feel that you need to do this? You could just interpret them as NaN instead. Commented Feb 28, 2015 at 3:04
  • I needed it because I didn't know interpreting them as NaN was an option. I should say that the real csv isn't filled in with all zeros like the example above, there are real values besides zero I still need to keep. Commented Feb 28, 2015 at 3:06
  • So you're working with a CSV file? Could you include that in your question? Commented Feb 28, 2015 at 3:07
  • correct, and edited. Commented Feb 28, 2015 at 3:10

3 Answers 3

2

You can use the pandas method convert_objects to do this. You'll set convert_numeric to True

convert_numeric : if True attempt to coerce to numbers (including strings), non-convertibles get NaN

>>> df
   station_id  year  month element    1   2  3  4  5  6
0      216565  2008      7    SNOW  0TT   0  0  0  0  0
1      216565  2008      8    SNOW    0  0T  0  0  0  0
2      216565  2008      9    SNOW    0   0  0  0  0  0
>>> df.convert_objects(convert_numeric=True)
   station_id  year  month element   1   2  3  4  5  6
0      216565  2008      7    SNOW NaN   0  0  0  0  0
1      216565  2008      8    SNOW   0 NaN  0  0  0  0
2      216565  2008      9    SNOW   0   0  0  0  0  0

If you wish to go the route of using replace, you need to modify your call.

>>> df
   station_id  year  month element    1   2  3  4  5  6
0      216565  2008      7    SNOW  0TT   0  0  0  0  0
1      216565  2008      8    SNOW    0  0T  0  0  0  0
2      216565  2008      9    SNOW    0   0  0  0  0  0
>>> df1.replace(value=np.nan, regex=r'[0-9][A-Z]+')
   station_id  year  month element    1    2  3  4  5  6
0      216565  2008      7    SNOW  NaN    0  0  0  0  0
1      216565  2008      8    SNOW    0  NaN  0  0  0  0
2      216565  2008      9    SNOW    0    0  0  0  0  0

This also requires that you import numpy (import numpy as np)

Sign up to request clarification or add additional context in comments.

1 Comment

this is great and so much easier than what I was trying! Thank you!
1

str.replace doesn't do regexes. Use the re module instead (assuming df is a string):

import re
re.sub(r'[0-9][A-Z]+', 'NaN', df)

returns:

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       NaN    0     0   0   0   0 
216565       2008      8    SNOW        0     NaN    0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0

However, you would be better off letting e.g. Pandas or np.genfromtxt handle the invalid values automatically.

5 Comments

It expects a string or buffer with this and I have a DataFrame, would I need to convert to string first then with this method?
That would probably work, but it would be better to use pandas or numpy to handle the invalid values when you create the DataFrame or ndarray from a file. Is this possible?
You will have to excuse my illiteracy with this I have only been working with python and pandas for about 2 weeks now so I am not sure if it is possible. The response from Andy seemed to have worked though but I do appreciate your help.
@spotter I think @xnx is interpreting your data as a string output. pandas does have a replace method. I've updated my answer to also include how it would work with replace as well.
I see how I got off track trying to use replace now, both methods are very helpful though, and again, many thanks.
-1
from re import sub

string = "station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW 0TT 0 0 0 0 0 216565 2008 8 SNOW 0 0T 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0"

string = sub(r'\d{1}[A-Za-z]{1,2}', 'NaN', string)

print string

# station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW NaN 0 0 0 0 0 216565 2008 8 SNOW 0 NaN 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0

1 Comment

What happened to the dataframe?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.