0

I have 2 dataframes as following:

dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

And my output is to generate a new dataframe with column B in dfb and another column of distance(e.g. Hamming distance from AC to AA is 1) between every element from B to A, like this:

   B    disB  disB disB
0  AC    1    2    1
1  BC    2    1    1
2  CC    2    2    0

The codes I have tried like this (courtesy of other posts):

dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

df_summary = dfb.copy()

for seq1 in dfa.A:
    df__ = []
    for seq2 in dfb.B:
        hd = sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
        df__.append(hd)

    df_summary['dis_{}'.format(column)] = pd.DataFrame({'dis_' + column: df__}).values
    print(df_summary)

The result will give me 3 outputs:

    B  dis_B
0  AC      1
1  BC      2
2  CC      2
    B  dis_B
0  AC      2
1  BC      1
2  CC      2
    B  dis_B
0  AC      1
1  BC      1
2  CC      0

but I need to combine them into one, like:

   B    disB  disB disB
0  AC    1    2    1
1  BC    2    1    1
2  CC    2    2    0

Thanks for your help!

2 Answers 2

1

A vectorized (read "much faster") solution:

a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
b = np.array(dfb['B'].str.split('').str[1:-1].tolist())

dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)

Output:

>>> dfb
    B  disB_1  disB_2  disB_3
0  AC       1       2       1
1  BC       2       1       1
2  CC       2       2       0
Sign up to request clarification or add additional context in comments.

5 Comments

Really fast! One more question I have is if I have 1000 columns of disB_1, disB_2,..., disB_1000, how do I name it automatically?
@Makunata This should do it: dfb = pd.concat([dfb, pd.DataFrame((a != b[:, None]).sum(axis=2)).add_prefix('disB_')], axis=1)
You can also simply label the columns using the strings that are used to calculate the Hamming distances in the columns, which may be a more natural naming convention, and almost certainly more intuitive since then the columns will be self-documenting, just like the rows which have as index the strings from 'B'.
Naming is not essential here. I created them to explain the meaning of each element.
The suggestion in my comment was meant to respond to the question in your comment and not particular to the answer by @richardec.
1

Here's an answer that gives a result in a slightly different form than the question frames things, but uses the values of 'A' and 'B' as the index and columns of the dataframe result, which may be more descriptive of the ultimate result:

import pandas as pd

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)

Output:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

Here is a performance comparison between the above approach creating a general matrix and a solution using numpy shown in another answer which uses hardcoded column names:

import pandas as pd
import numpy as np

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)


dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

def foo(dfa, dfb):
    df = pd.DataFrame(data=[[sum(c != d for c, d in zip(dfb['B'][i], dfa['A'][j])) for j in range(len(dfa['A']))] for i in range(len(dfb['B']))], index=dfb['B'], columns=dfa['A'])
    return df
    


def bar(dfa, dfb):
    a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
    b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
    dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
    return dfb

import timeit

print("\nGeneral matrix approach:")
t = timeit.timeit(lambda: foo(dfa, dfb), number = 100)
print(f"timeit: {t}")

print("\nHarcoded columns approach:")
t = timeit.timeit(lambda: bar(dfa, dfb), number = 100)
print(f"timeit: {t}")

Output and performance via timeit:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

General matrix approach:
timeit: 0.023536499997135252

Harcoded columns approach:
timeit: 0.03922149998834357

This seems to show that the numpy approach takes about 1.5-2x as long as the general matrix approach in this answer.

2 Comments

Actually, I have a huge file of dfb. I was wondering if there is an automatical naming method for your initial defined 'lists' without specifying the names.
@Makunata Can you please clarify a bit. It seems like you have multiple dfb "lists" in a file. Is there also an actual dfa "list" somewhere, and is there just one of these? I was picturing one axis of your results being a single dfb list and the other axis being a single dfa list - is this correct, or is the shape of your data different?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.