Replace values of specific pattern with 'NaN' in csv file

Question

I have a csv file similar to this but with about 155,000 rows with years from 1910-2010 and 83 different station id's:

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       0TT    0     0   0   0   0 
216565       2008      8    SNOW        0     0T    0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0   0

and I want to replace any value that has a pattern of a number and then one letter or a number and then two letter with NaN.

My desired output then is:

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       NaN    0     0   0   0   0 
216565       2008      8    SNOW        0     NaN   0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0   0

I have tried to use:

replace=df.replace([r'[0-9] [A-Z]'], ['NA']) replace2=replace.replace([r'[0-9][A-Z][A-Z]'], ['NA'])

I was hoping by using the pattern of [0-9] [A-Z] would take care of a number and just one letter and then [0-9][A-Z][A-Z] would replace any cells with 2 letters but the file stays the exact same even though no errors are returned.

Any help would be much appreciated.

Why do you feel that you need to do this? You could just interpret them as NaN instead. — Makoto
– Makoto, Commented Feb 28, 2015 at 3:04
I needed it because I didn't know interpreting them as NaN was an option. I should say that the real csv isn't filled in with all zeros like the example above, there are real values besides zero I still need to keep. — spotter
– spotter, Commented Feb 28, 2015 at 3:06
So you're working with a CSV file? Could you include that in your question? — Makoto
– Makoto, Commented Feb 28, 2015 at 3:07

Andy · Accepted Answer · 2015-02-28 03:29:17Z

2

You can use the pandas method convert_objects to do this. You'll set convert_numeric to True

convert_numeric : if True attempt to coerce to numbers (including strings), non-convertibles get NaN

>>> df
   station_id  year  month element    1   2  3  4  5  6
0      216565  2008      7    SNOW  0TT   0  0  0  0  0
1      216565  2008      8    SNOW    0  0T  0  0  0  0
2      216565  2008      9    SNOW    0   0  0  0  0  0
>>> df.convert_objects(convert_numeric=True)
   station_id  year  month element   1   2  3  4  5  6
0      216565  2008      7    SNOW NaN   0  0  0  0  0
1      216565  2008      8    SNOW   0 NaN  0  0  0  0
2      216565  2008      9    SNOW   0   0  0  0  0  0

If you wish to go the route of using replace, you need to modify your call.

>>> df
   station_id  year  month element    1   2  3  4  5  6
0      216565  2008      7    SNOW  0TT   0  0  0  0  0
1      216565  2008      8    SNOW    0  0T  0  0  0  0
2      216565  2008      9    SNOW    0   0  0  0  0  0
>>> df1.replace(value=np.nan, regex=r'[0-9][A-Z]+')
   station_id  year  month element    1    2  3  4  5  6
0      216565  2008      7    SNOW  NaN    0  0  0  0  0
1      216565  2008      8    SNOW    0  NaN  0  0  0  0
2      216565  2008      9    SNOW    0    0  0  0  0  0

This also requires that you import numpy (import numpy as np)

edited Feb 28, 2015 at 3:29

answered Feb 28, 2015 at 3:13

Andy♦

50.8k62 gold badges181 silver badges240 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

spotter Over a year ago

this is great and so much easier than what I was trying! Thank you!

xnx · Accepted Answer · 2015-02-28 03:06:42Z

1

str.replace doesn't do regexes. Use the re module instead (assuming df is a string):

import re
re.sub(r'[0-9][A-Z]+', 'NaN', df)

returns:

station_id  year    month   element    1     2     3   4   5    6
216565       2008      7    SNOW       NaN    0     0   0   0   0 
216565       2008      8    SNOW        0     NaN    0   0   0   0 
216565       2008      9    SNOW        0     0     0   0   0

However, you would be better off letting e.g. Pandas or np.genfromtxt handle the invalid values automatically.

answered Feb 28, 2015 at 3:06

xnx

25.7k11 gold badges75 silver badges117 bronze badges

5 Comments

spotter Over a year ago

It expects a string or buffer with this and I have a DataFrame, would I need to convert to string first then with this method?

xnx Over a year ago

That would probably work, but it would be better to use pandas or numpy to handle the invalid values when you create the DataFrame or ndarray from a file. Is this possible?

spotter Over a year ago

You will have to excuse my illiteracy with this I have only been working with python and pandas for about 2 weeks now so I am not sure if it is possible. The response from Andy seemed to have worked though but I do appreciate your help.

Andy Over a year ago

@spotter I think @xnx is interpreting your data as a string output. pandas does have a replace method. I've updated my answer to also include how it would work with replace as well.

spotter Over a year ago

I see how I got off track trying to use replace now, both methods are very helpful though, and again, many thanks.

Katpoes · Accepted Answer · 2015-02-28 03:06:48Z

-1

from re import sub

string = "station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW 0TT 0 0 0 0 0 216565 2008 8 SNOW 0 0T 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0"

string = sub(r'\d{1}[A-Za-z]{1,2}', 'NaN', string)

print string

# station_id year month element 1 2 3 4 5 6 216565 2008 7 SNOW NaN 0 0 0 0 0 216565 2008 8 SNOW 0 NaN 0 0 0 0 216565 2008 9 SNOW 0 0 0 0 0 0

answered Feb 28, 2015 at 3:06

Katpoes

2091 silver badge11 bronze badges

1 Comment

Andy Over a year ago

What happened to the dataframe?

Collectives™ on Stack Overflow

Replace values of specific pattern with 'NaN' in csv file

3 Answers 3

1 Comment

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related