0

I am attempting to iterate over a specific column in my dataframe.

The column is:

df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']

I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).

To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.

I also attempted to use pd.DataFrame.where, but am getting an error:

for i,row in df.iterrows():
    df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])

``AttributeError: 'numpy.ndarray' object has no attribute 'replace'

Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).

4
  • what about 'mill' then? Commented Jan 26, 2020 at 17:43
  • Why are you using a for loop for this? Commented Jan 26, 2020 at 18:26
  • @AMC it's my natural instinct when looking to iterate over a df with conditions, though I'm seeing with the answers below that a for loop is unnecessary and time-consuming. Commented Jan 27, 2020 at 2:03
  • @AdamA it's my natural instinct when looking to iterate over a df with conditions Gotta work on those instincts, then! You should take a look at the Pandas docs, I find them quite good. Commented Jan 27, 2020 at 2:17

3 Answers 3

2

Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:

x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')

This'll give you:

         0        1
0      1.4  million
1  1235000      NaN
2      100  million
3      NaN      NaN
4       14  million
5      2.5     mill

Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:

res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)

That'll give you:

0      1400000.0
1      1235000.0
2    100000000.0
3            NaN
4     14000000.0
5      2500000.0
Sign up to request clarification or add additional context in comments.

Comments

0

Try this:

df['column'].apply(lambda x : x.replace('million','00000'))

Make sure your dtype is string before applying this

2 Comments

Why would you not use the operations provided by Pandas?
Thanks for your answer, however the issue had been that some of the cells with "million" needed 5 zeros and some needed 6 zeros.
0

For the given data:

df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
                   if 'million' in str(x) or 'mill' in str(x) else x)

If there may be many forms of million in the column, then regex search.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.