New columns not assigned in custom function (Python)

Question

My goal is to define a function that overwrites whatever input is given to it. It should add columns to the object and then merge it with a data frame defined within the function itself. I noticed that the columns I manually declare are being written on the object, but the columns that result from a merge are not being added.

This is what my data, df, looks like:

  col1                  col2
0    Q       V V V V V V V V
1    Q             V V V V V
2    Q       V V V V V V V V
3    Q   V V-- V V V V V V V
4    Q   V V V V V V V V V V

In this dummy example, I would like to write a custom function that adds a column full of ones to the input and then merges it with another data frame. Note that the function is not returning another object, but rather, it is overwriting the object that was fed to it.

def f(data):
    from pandas import DataFrame, merge  
    data['ones'] = 1
    temp = DataFrame({'col1':['C','Q','M'], 'col3':[14,15,30]})
    data = merge(data, temp, on='col1')

f(df)
  col1                  col2  ones
0    Q       V V V V V V V V     1
1    Q             V V V V V     1
2    Q       V V V V V V V V     1
3    Q   V V-- V V V V V V V     1
4    Q   V V V V V V V V V V     1

Why is the result from the merge not being written over df while the df['ones'] is?

Cameron Riddell · Accepted Answer · 2020-09-28 20:40:59Z

Item assignment in Pandas happens in place. Much like a dictionary, performing:

my_dict = {}
my_dict["ones"] = 1 # modifies the dictionary in place

However the majority of pandas functions don't operate in place, they create a copy and return the copy. This is even true for the functions that carry an inplace keyword argument. Setting the inplace to true only mimics an actual "in place" change, by first creating a copy of the object, then replacing the original object with the modified one- not updating a subset of data.

You can achieve your result by doing the same as described above and changing your function to read:

def inplace_merge(df1, df2, on):
    # Modifies df1 inplace
    #  probably not as efficient as an actual 
    #  merge in terms of performance
    
    df2 = df2.set_index(on).reindex(df1[on])
    for col in df2:
        df1[col] = df2[col].values
    

def f(data):
    from pandas import DataFrame, merge  
    data['ones'] = 1
    
    temp = DataFrame({'col1':['C','Q','M'], 'col3':[14,15,30]})
    inplace_merge(data, temp, on="col1")


f(df)

print(df)
  col1                 col2  ones  col3
0    Q      V V V V V V V V     1    15
1    Q            V V V V V     1    15
2    Q      V V V V V V V V     1    15
3    Q  V V-- V V V V V V V     1    15
4    Q  V V V V V V V V V V     1    15

However, I would strongly recommend that you do not use a ton of functions that modify a single dataframe in place. Pass around copies, pandas is designed for ease of use, not ease of memory consumption. There are other libraries such as vaex that can deal with DataFrame-like objects with zero copy functions.

Tunahan A. · Accepted Answer · 2020-09-28 20:30:57Z

0

I think you didn't return the function.

def f(data):
   from pandas import DataFrame, merge  
   data['ones'] = 1
   temp = DataFrame({'col1':['C','Q','M'], 'col3':[14,15,30]})
   data = merge(data, temp, on = 'col1')
   return data

answered Sep 28, 2020 at 20:30

Tunahan A.

1428 bronze badges

Collectives™ on Stack Overflow

New columns not assigned in custom function (Python)

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related