I have a csv file that I am turning into a pandas dataframe. One of the columns a is mostly filled with numbers and occasionally filled with zeros. I would like to make a new column e that is the number in column a+500, EXCEPT if there is a 0 in that row for the a column. In those cases it should just stay 0. Or I guess it would also work if column e was a+500, and then all of the cases in e that were just 500 were turned into zero. Any help would be great, I'm new to using pandas/python.
-
1Hi! Welcome to Stack Overflow! Your question is too broad as is. Please read How to Ask and start trying yourself. Once you get stuck, ask here providing a minimal reproducible example! Thank you!jkalden– jkalden2016-01-22 13:38:49 +00:00Commented Jan 22, 2016 at 13:38
-
I disagree with @jkalden as the question is rather specific but the title is very misleading. Please edit your title to something more specific like "adding value to a column in a df in pandas python depending on another columns value". I agree you should add a minimal example that at least shows input and expected output.Fabian Rost– Fabian Rost2016-01-22 13:48:40 +00:00Commented Jan 22, 2016 at 13:48
Add a comment
|
3 Answers
Example data frame
>>> df = pd.DataFrame({'a': [100, 0, 200], 'b': [200, 500, 0]})
>>> df
a b
0 100 200
1 0 500
2 200 0
Solution
You can use where for fast generation of your column:
>>> df['e'] = df['a'].where(df['a'] == 0, df['a'] + 500)
>>> df
a b e
0 100 200 600
1 0 500 0
2 200 0 700
Performance
For a data frame with three million rows:
n = int(1e6)
df = pd.DataFrame({'a': [100, 0, 200] * n, 'b': [200, 500, 0] * n})
using apply (as suggested in another answer here) is pretty slow:
%timeit df['new_a'] = df['a'].astype('int').map(lambda x: x+500 if x != 0 else 0)
1 loops, best of 3: 2.5 s per loop
compared to using where():
%timeit df['e'] = df['a'].where(df['a'] == 0, df['a'] + 500)
10 loops, best of 3: 90.9 ms per loop
It is about 28 times faster.