0

My goal is to define a function that overwrites whatever input is given to it. It should add columns to the object and then merge it with a data frame defined within the function itself. I noticed that the columns I manually declare are being written on the object, but the columns that result from a merge are not being added.

This is what my data, df, looks like:

  col1                  col2
0    Q       V V V V V V V V
1    Q             V V V V V
2    Q       V V V V V V V V
3    Q   V V-- V V V V V V V
4    Q   V V V V V V V V V V

In this dummy example, I would like to write a custom function that adds a column full of ones to the input and then merges it with another data frame. Note that the function is not returning another object, but rather, it is overwriting the object that was fed to it.

def f(data):
    from pandas import DataFrame, merge  
    data['ones'] = 1
    temp = DataFrame({'col1':['C','Q','M'], 'col3':[14,15,30]})
    data = merge(data, temp, on='col1')
f(df)
  col1                  col2  ones
0    Q       V V V V V V V V     1
1    Q             V V V V V     1
2    Q       V V V V V V V V     1
3    Q   V V-- V V V V V V V     1
4    Q   V V V V V V V V V V     1

Why is the result from the merge not being written over df while the df['ones'] is?

2 Answers 2

1

Item assignment in Pandas happens in place. Much like a dictionary, performing:

my_dict = {}
my_dict["ones"] = 1 # modifies the dictionary in place

However the majority of pandas functions don't operate in place, they create a copy and return the copy. This is even true for the functions that carry an inplace keyword argument. Setting the inplace to true only mimics an actual "in place" change, by first creating a copy of the object, then replacing the original object with the modified one- not updating a subset of data.

You can achieve your result by doing the same as described above and changing your function to read:

def inplace_merge(df1, df2, on):
    # Modifies df1 inplace
    #  probably not as efficient as an actual 
    #  merge in terms of performance
    
    df2 = df2.set_index(on).reindex(df1[on])
    for col in df2:
        df1[col] = df2[col].values
    

def f(data):
    from pandas import DataFrame, merge  
    data['ones'] = 1
    
    temp = DataFrame({'col1':['C','Q','M'], 'col3':[14,15,30]})
    inplace_merge(data, temp, on="col1")


f(df)

print(df)
  col1                 col2  ones  col3
0    Q      V V V V V V V V     1    15
1    Q            V V V V V     1    15
2    Q      V V V V V V V V     1    15
3    Q  V V-- V V V V V V V     1    15
4    Q  V V V V V V V V V V     1    15

However, I would strongly recommend that you do not use a ton of functions that modify a single dataframe in place. Pass around copies, pandas is designed for ease of use, not ease of memory consumption. There are other libraries such as vaex that can deal with DataFrame-like objects with zero copy functions.

Sign up to request clarification or add additional context in comments.

Comments

0

I think you didn't return the function.

def f(data):
   from pandas import DataFrame, merge  
   data['ones'] = 1
   temp = DataFrame({'col1':['C','Q','M'], 'col3':[14,15,30]})
   data = merge(data, temp, on = 'col1')
   return data

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.