0

I have a table (df1) with a list of values (neig_list, which is a python list) in each row

ID neig_list
1 a, b, d
2 b, e, f, g, h
3 b, a, j, k

And a table (df2) with entries for those values

neig samples samples_indicator
'a' 3 0.5
'a' 5 0.1
'b' 1 0.2
'c' 15 0.5
'd' 12 0.3
'a' 2 1
'e' 5 0.6
'f' 6 0
'h' 6 0.5

I need to add a column to df1 getting, for each row, the result for the sum of samples x samples_indicator for all neigs that are contained in the neig_list for that row.

For example, for the first row, we would have:

3*0.5 + 5*0.1 + 1*0.2 + 12*0.3 + 2*1 = 7.8

ID neig_list new_column
1 a, b, d 7.8
2 b, e, f, g, h value
3 b, a, j, k value

Actually, the function is more complicated than that (involves more columns), so ideally I'd like to have a separate function and then apply it to df1, based on df2.

3 Answers 3

1

Calculate first your math in df2:

map_ = df2.assign(neig = df2['neig'].str.strip("'"), 
                  calculated = lambda df: df['samples'] * df['samples_indicator'])\
          .groupby('neig')['calculated'].sum()

Then, explode your first df, and map the values above for 'a', 'b' etc with the calculated formula. Finally, groupby and sum:

df['new_column'] = df['neig_list'].str.split(', ').explode()\
                                  .map(map_)\
                                  .groupby(level=0)\  
                                  .sum()

   ID      neig_list  new_column
0   1        a, b, d         7.8
1   2  b, e, f, g, h         6.2
2   3     b, a, j, k         4.2
Sign up to request clarification or add additional context in comments.

Comments

1

You can just define a function that performs the calculations for a given list of neigs using df2 and then just apply it to neig_list in df1:

def result(row):
    return sum([df2['samples'][item]*df2['samples_indicator'][item] for item in row])

df1['new_column'] = df1['neig_list'].apply(result)

Note that this requires neig to be the index in df2. If it's not, you can do df2.set_index('neig', inplace=True) or, if you don't want to modify d2:

def result(row):
    return sum([df2.set_index('neig')['samples'][item]*df2.set_index('neig')['samples_indicator'][item] for item in row])

and apply it the same way as before.

Comments

1

Here's a way to do what your question asks:

def foo(df1, df2):
    return (df1
        .join(df1.assign(neig=df1.neig_list).explode('neig')
            .join(
                df2.assign(new_column=df2.samples * df2.samples_indicator)[['neig','new_column']].groupby('neig').sum(), 
                on='neig')
                .drop(columns=['neig','neig_list']).groupby('ID').sum(), 
            on='ID')
        )
print(foo(df1, df2))

Output:

   ID        neig_list  new_column
0   1        [a, b, d]         7.8
1   2  [b, e, f, g, h]         6.2
2   3     [b, a, j, k]         4.2

Explanation:

  • use assign() to add new_column as a column to df2 which, using groupby() and sum(), gets populated with the dot-product of samples and samples_indicator for the rows in each neig group
  • use assign() to clone the neig_list column of df1 as neig and explode() to expand each row to one row per item in the neig column
  • use join() on the above two DataFrame objects to put sample results from new_column into each row based on its neig value
  • use join() again with the above DataFrame object (after dropping the neig and neig_list columns) to add the desired column to the original df1.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.