Iterating over dataframe and using replace method based on condtions

Question

I am attempting to iterate over a specific column in my dataframe.

The column is:

df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']

I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).

To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.

I also attempted to use pd.DataFrame.where, but am getting an error:

for i,row in df.iterrows():
    df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])

``AttributeError: 'numpy.ndarray' object has no attribute 'replace'

Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).

@AMC it's my natural instinct when looking to iterate over a df with conditions, though I'm seeing with the answers below that a for loop is unnecessary and time-consuming. — AdamA
– AdamA, Commented Jan 27, 2020 at 2:03
@AdamA it's my natural instinct when looking to iterate over a df with conditions Gotta work on those instincts, then! You should take a look at the Pandas docs, I find them quite good. — AMC
– AMC, Commented Jan 27, 2020 at 2:17

Jon Clements · Accepted Answer · 2020-01-27 02:34:15Z

2

Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:

x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')

This'll give you:

         0        1
0      1.4  million
1  1235000      NaN
2      100  million
3      NaN      NaN
4       14  million
5      2.5     mill

Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:

res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)

That'll give you:

0      1400000.0
1      1235000.0
2    100000000.0
3            NaN
4     14000000.0
5      2500000.0

edited Jan 27, 2020 at 2:34

answered Jan 26, 2020 at 17:44

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Shubham Shaswat · Accepted Answer · 2020-01-26 17:41:10Z

0

Try this:

df['column'].apply(lambda x : x.replace('million','00000'))

Make sure your dtype is string before applying this

answered Jan 26, 2020 at 17:41

Shubham Shaswat

1,3109 silver badges14 bronze badges

2 Comments

AMC Over a year ago

Why would you not use the operations provided by Pandas?

AdamA Over a year ago

Thanks for your answer, however the issue had been that some of the cells with "million" needed 5 zeros and some needed 6 zeros.

ggaurav · Accepted Answer · 2020-01-26 17:51:07Z

0

For the given data:

df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
                   if 'million' in str(x) or 'mill' in str(x) else x)

If there may be many forms of million in the column, then regex search.

answered Jan 26, 2020 at 17:51

ggaurav

1,8041 gold badge11 silver badges11 bronze badges

Collectives™ on Stack Overflow

Iterating over dataframe and using replace method based on condtions

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related