Create new dataframe column based on values in multiple columns

Question

FYI, performance/speed is not important for this question.

I have an existing pandas dataframe named cost_table ...

+----------+---------+------+-------------------------+-----------------+
| material | percent | qty  | price_control_indicator | acct_assign_cat |
+----------+---------+------+-------------------------+-----------------+
| abc111   | 1.00    |   50 | v                       | #               |
| abc222   | 0.25    | 2000 | s                       | #               |
| xyz789   | 0.45    |    0 | v                       | m               |
| def456   | 0.9     |    0 | v                       | #               |
| 123xyz   | 0.2     |    0 | v                       | m               |
| lmo888   | 0.6     |    0 | v                       | m               |
+----------+---------+------+-------------------------+-----------------+

I need to add a field cost_source based on values in multiple fields.

Most answers that come up on google involve a list comprehension or a ternary operator but those only include logic based on a value in one column. For example,

cost_table['cost_source'] = ['map' if qty > 0 else None for qty in cost_table['qty']]

This works based on a value in one column, but I don't know how to expand this to include logic in multiple columns (or if it's even possible?). It also doesn't seem like a very readable/maintainable solution.

I tried using a for in loop with an if elif statement but the value in cost_table['cost_source'] remains unchanged and is None for all rows. But if I print each individual row within my loop then row['cost_source'] has the desired value.

d = {
  'material': ['abc111', 'abc222', 'xyz789', 'def456', '123xyz', 'lmo888'],
  'percent': [1, .25, .45, .9, .2, .6],
  'qty': [50, 2000, 0, 0, 0, 0],
  'price_control_indicator': ['v', 's','v', 'v', 'v', 'v'],
  'acct_assign_cat': ['#', '#', 'm', '#', 'm', 'm']
}

cost_table = pd.DataFrame(data=d)

cost_table['cost_source'] = None

for index, row in cost_table.iterrows():
  if (row['qty'] > 0) or (row['price_control_indicator'] == "s") or (row['acct_assign_cat'] == "#"):
    row['cost_source'] = "map"
  elif (row['percent'] >= 40) and (row['acct_assign_cat'] == "m"):
    row['cost_source'] = "vendor"
  else:
    row['cost_source'] = None

  print(row['cost_source']) # outputs map, vendor, or None as expected

print(cost_table)

Which outputs ...

+----------+---------+------+-------------------------+-----------------+-------------+
| material | percent | qty  | price_control_indicator | acct_assign_cat | cost_source |
+----------+---------+------+-------------------------+-----------------+-------------+
| abc111   | 1.00    |   50 | v                       | #               | None        |
| abc222   | 0.25    | 2000 | s                       | #               | None        |
| xyz789   | 0.45    |    0 | v                       | m               | None        |
| def456   | 0.9     |    0 | v                       | #               | None        |
| 123xyz   | 0.2     |    0 | v                       | m               | None        |
| lmo888   | 0.6     |    0 | v                       | m               | None        |
+----------+---------+------+-------------------------+-----------------+-------------+

And this is my desired result ...

+----------+---------+------+-------------------------+-----------------+-------------+
| material | percent | qty  | price_control_indicator | acct_assign_cat | cost_source |
+----------+---------+------+-------------------------+-----------------+-------------+
| abc111   | 1.00    |   50 | v                       | #               | map         |
| abc222   | 0.25    | 2000 | s                       | #               | map         |
| xyz789   | 0.45    |    0 | v                       | m               | vendor      |
| def456   | 0.9     |    0 | v                       | #               | map         |
| 123xyz   | 0.2     |    0 | v                       | m               | None        |
| lmo888   | 0.6     |    0 | v                       | m               | vendor      |
+----------+---------+------+-------------------------+-----------------+-------------+

This might help you: stackoverflow.com/questions/26886653/… As there are multiple columns to be passed in the function, you can pass them like df.apply(lambda x: fun(x[0], x[1], x[2])). The index depends on which all columns you want. — bazinga
– bazinga, Commented Feb 15, 2019 at 15:35

Ronald Luc · Accepted Answer · 2019-02-15 16:15:51Z

5

As @bazinga stated, use df.apply(lambda x: fun(x), but with parameter axis=1, so the lambda function is applied to row by row (default is column by column).

d = {
  'material': ['abc111', 'abc222', 'xyz789', 'def456', '123xyz', 'lmo888'],
  'percent': [100, 25, 45, 90, 20, 60],
  'qty': [50, 2000, 0, 0, 0, 0],
  'price_control_indicator': ['v', 's','v', 'v', 'v', 'v'],
  'acct_assign_cat': ['#', '#', 'm', '#', 'm', 'm']
}

cost_table = pd.DataFrame(data=d)

def process_row(row):
    if (row['qty'] > 0) or (row['price_control_indicator'] == "s") or (row['acct_assign_cat'] == "#"):
        return "map"
    elif (row['percent'] >= 40) and (row['acct_assign_cat'] == "m"):
        return "vendor"
    else:
        return None

cost_table['cost_source'] = cost_table.apply(lambda row: process_row(row), axis=1)

print(cost_table)

(I also corrected an inconsistency: in the data procents should be probably multiplied by 100)

edited Feb 15, 2019 at 16:15

answered Feb 15, 2019 at 16:09

Ronald Luc

1,2088 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Serge Ballesta Over a year ago

I think that this is rather idiomatic. But it should be noted that apply here will break what is called vectorization in pandas, because it will require to switch to the Python interpretor in each row to execute the function. But anyway it is still supposed to be much more efficient than iterrows.

Angelo Bovino Over a year ago

@Ronald Luc Thanks for your help. And good catch on the percents column!

Ronald Luc Over a year ago

@AngeloBovino Glad to help. np.select is more suited for this purpose, but DataFrame.apply is more general and clear for explanation.

ALollz · Accepted Answer · 2019-02-15 16:33:26Z

If you wish to use np.select

cond1 = cost_table.qty.gt(0) | cost_table.price_control_indicator.eq('s') | cost_table.acct_assign_cat.eq('#')
cond2 = cost_table.percent.ge(0.4) & cost_table.acct_assign_cat.eq('m')
cost_table['cost_source'] = np.select([cond1, cond2], ['map', 'vendor'], default='None')
print(cost_table)

  material  percent   qty price_control_indicator acct_assign_cat cost_source
0   abc111     1.00    50                       v               #         map
1   abc222     0.25  2000                       s               #         map
2   xyz789     0.45     0                       v               m      vendor
3   def456     0.90     0                       v               #         map
4   123xyz     0.20     0                       v               m        None
5   lmo888     0.60     0                       v               m      vendor

Collectives™ on Stack Overflow

Create new dataframe column based on values in multiple columns

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related