How to merge for loop output dataframes into one with python?

Question

I have 2 dataframes as following:

dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

And my output is to generate a new dataframe with column B in dfb and another column of distance(e.g. Hamming distance from AC to AA is 1) between every element from B to A, like this:

   B    disB  disB disB
0  AC    1    2    1
1  BC    2    1    1
2  CC    2    2    0

The codes I have tried like this (courtesy of other posts):

dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

df_summary = dfb.copy()

for seq1 in dfa.A:
    df__ = []
    for seq2 in dfb.B:
        hd = sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
        df__.append(hd)

    df_summary['dis_{}'.format(column)] = pd.DataFrame({'dis_' + column: df__}).values
    print(df_summary)

The result will give me 3 outputs:

    B  dis_B
0  AC      1
1  BC      2
2  CC      2
    B  dis_B
0  AC      2
1  BC      1
2  CC      2
    B  dis_B
0  AC      1
1  BC      1
2  CC      0

but I need to combine them into one, like:

   B    disB  disB disB
0  AC    1    2    1
1  BC    2    1    1
2  CC    2    2    0

Thanks for your help!

user17242583 · Accepted Answer · 2022-03-15 00:52:46Z

1

A vectorized (read "much faster") solution:

a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
b = np.array(dfb['B'].str.split('').str[1:-1].tolist())

dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)

Output:

>>> dfb
    B  disB_1  disB_2  disB_3
0  AC       1       2       1
1  BC       2       1       1
2  CC       2       2       0

answered Mar 15, 2022 at 0:52

user17242583

Sign up to request clarification or add additional context in comments.

5 Comments

Makunata Over a year ago

Really fast! One more question I have is if I have 1000 columns of disB_1, disB_2,..., disB_1000, how do I name it automatically?

user17242583 Over a year ago

@Makunata This should do it: dfb = pd.concat([dfb, pd.DataFrame((a != b[:, None]).sum(axis=2)).add_prefix('disB_')], axis=1)

constantstranger Over a year ago

You can also simply label the columns using the strings that are used to calculate the Hamming distances in the columns, which may be a more natural naming convention, and almost certainly more intuitive since then the columns will be self-documenting, just like the rows which have as index the strings from 'B'.

Makunata Over a year ago

Naming is not essential here. I created them to explain the meaning of each element.

constantstranger Over a year ago

The suggestion in my comment was meant to respond to the question in your comment and not particular to the answer by @richardec.

constantstranger · Accepted Answer · 2022-03-15 01:17:45Z

1

Here's an answer that gives a result in a slightly different form than the question frames things, but uses the values of 'A' and 'B' as the index and columns of the dataframe result, which may be more descriptive of the ultimate result:

import pandas as pd

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)

Output:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

Here is a performance comparison between the above approach creating a general matrix and a solution using numpy shown in another answer which uses hardcoded column names:

import pandas as pd
import numpy as np

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)


dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

def foo(dfa, dfb):
    df = pd.DataFrame(data=[[sum(c != d for c, d in zip(dfb['B'][i], dfa['A'][j])) for j in range(len(dfa['A']))] for i in range(len(dfb['B']))], index=dfb['B'], columns=dfa['A'])
    return df
    


def bar(dfa, dfb):
    a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
    b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
    dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
    return dfb

import timeit

print("\nGeneral matrix approach:")
t = timeit.timeit(lambda: foo(dfa, dfb), number = 100)
print(f"timeit: {t}")

print("\nHarcoded columns approach:")
t = timeit.timeit(lambda: bar(dfa, dfb), number = 100)
print(f"timeit: {t}")

Output and performance via timeit:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

General matrix approach:
timeit: 0.023536499997135252

Harcoded columns approach:
timeit: 0.03922149998834357

This seems to show that the numpy approach takes about 1.5-2x as long as the general matrix approach in this answer.

edited Mar 15, 2022 at 1:17

answered Mar 15, 2022 at 1:04

constantstranger

9,4072 gold badges9 silver badges20 bronze badges

2 Comments

Makunata Over a year ago

Actually, I have a huge file of dfb. I was wondering if there is an automatical naming method for your initial defined 'lists' without specifying the names.

constantstranger Over a year ago

@Makunata Can you please clarify a bit. It seems like you have multiple dfb "lists" in a file. Is there also an actual dfa "list" somewhere, and is there just one of these? I was picturing one axis of your results being a single dfb list and the other axis being a single dfa list - is this correct, or is the shape of your data different?

Collectives™ on Stack Overflow

How to merge for loop output dataframes into one with python?

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related