0

I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.

    Dataframe A
 
0   "4012, 4065, 4682"
1   "4712, 2339, 5652, 10007"
2   "4618, 8987"
3   "7447, 4615, 4012"
4   "6515"
5   "4065, 2339, 4012"


    Dataframe B

0   "6515, 4012, 4618, 8987"         <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1   "4065, 5116, 2339, 8757, 4012"   <- matches (DF A, Index 5) (4065, 2339, 4012)
2   "1101"
3   "6515"                           <- matches (DF A, Index 4) (6515)
4   "4012, 4615, 7447"               <- matches (DF A, Index 3) (7447, 4615, 4012)
5   "7447, 6515, 4012, 4615"         <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)


    Desired Output:

    Itemset             Count

2   4618, 8987            1
3   7447, 4165, 4012      2
4   6515                  3
5   4065, 2339, 4012      1

Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.

EDIT Need for speed edition: This is a redo question from my previous post: Compare two dataframe columns for matching strings or are substrings then count in pandas

I have millions of rows in both dfA and dfB to make these comparisons against. In my previous post, the following code got the job done:

import pandas as pd

dfA = pd.DataFrame(["4012, 4065, 4682",
                    "4712, 2339, 5652, 10007",
                    "4618, 8987",
                    "7447, 4615, 4012",
                    "6515",
                    "4065, 2339, 4012",],
                    columns=['values'])

dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
                    "4065, 5116, 2339, 8757, 4012",
                    "1101",
                    "6515",
                    "4012, 4615, 7447",
                    "7447, 6515, 4012, 4615"],
                    columns=['values'])

dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')


dfA['overlap_A'] = [sum(all(val in cell for val in row)
                    for cell in dfB['values_list']) 
                    for row in dfA['values_list']]

However with the total amount of rows to check, I am experiencing a performance issue and need another way to check the frequency / counts. Seems like Numpy is needed in this case. this is about the extent of my numpy knowledge as I work primarily in pandas. Anyone have suggestions to make this faster?

dfA_array = dfA['values_list'].to_numpy()
dfB_array = dfB['values_list'].to_numpy()

1 Answer 1

1

give this a try. Your algorithm is O(NNK): square of count * words per line. Below should improve to O(NK)

from collections import defaultdict
from functools import reduce

d=defaultdict(set)
for i,t in enumerate(dfB['values']):
    for s in t.split(', '):
        d[s].add(i)
dfA['count']=dfA['values'].apply(lambda x:len(reduce(lambda a,b: a.intersection(b), [d[s] for s in x.split(', ') ])))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.