Find string in a dataframe from a list in another dataframe

Question

I have 2 pandas dataframes in python which are set up as folows:

Dataframe 1:
ID      Paragraph
1      'A B C D E'
2      'A F G H L'
3      'B J P Q W'
4      'G F D S A'

Where Paragraph is a string of multiple words.

Dataframe 2: 
ID      Name        Words
1      First      ['A', 'F']
2      Second     ['B', 'Z']
3      Thrird     ['P', 'Q']
4      Fourth     ['H', 'J']

Name is just a string identifying the Words. And Words is a list of strings.

So what I want to do is have an expression that will identify which Paragraphs in Dataframe 1 contain Words from Dataframe 2. And I want to store the Name of the Words in a new column in Dataframe 1. The new column will contain a list of all the Names where a Word from Words occurred in the paragraph. The order does not matter and there must be no duplicates in the list.

For example:

New Dataframe 1:
ID      Paragraph             Names
1      'A B C D E'       [First, Second]
2      'A F G H L'       [First, Fourth]
3      'B J P Q W'   [Second, Third, Fourth]
4      'G F D S A'           [First]

I can only make a solution that has deeply nested for loops and takes a very long time to execute. Is there a solution that has a shorter computation time my thinking is maybe using loc and/or lambda functions.

Any help would be greatly appreciated!

Let me know if there is anything I need to clarify.

English is not my first language so I can try explain more if I need to.

Thank you

Here is the code for the dummy dataframes:

data_1 = {'Paragraph': ['A B C D E', 'A F G H L', 'B J P Q W', 'G F D S A']}
df_1 = pd.DataFrame(data_1)

data_2 = {'Name': ['First', 'Second', 'Third', 'Fourth'],
          'Words': [['A', 'F'], ['B', 'Z'], ['P', 'Q'], ['H', 'J']]}
df_2 = pd.DataFrame(data_2)

This is not the place where your task will be solved for you. — Сергей Кох
– Сергей Кох, Commented Oct 19, 2022 at 19:06
@СергейКох I can solve the problem just not efficiently — TheAmazingDerik
– TheAmazingDerik, Commented Oct 19, 2022 at 19:28

mozway · Accepted Answer · 2022-10-19 19:55:01Z

1

You can split and explode Paragraph. Then map the names for each word of the exploded df_2. Finally, aggregate as set to have unique values:

s = df_2.explode('Words').set_index('Words')['Name']
df_1['Names'] = (df_1['Paragraph'].str.split()
                 .explode().map(s).dropna()
                 .groupby(level=0).agg(set)
                )

output:

   Paragraph                    Names
0  A B C D E          {Second, First}
1  A F G H L          {Fourth, First}
2  B J P Q W  {Third, Second, Fourth}
3  G F D S A                  {First}

answered Oct 19, 2022 at 19:55

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Find string in a dataframe from a list in another dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related