1

I have 2 pandas dataframes in python which are set up as folows:

Dataframe 1:
ID      Paragraph
1      'A B C D E'
2      'A F G H L'
3      'B J P Q W'
4      'G F D S A'

Where Paragraph is a string of multiple words.

Dataframe 2: 
ID      Name        Words
1      First      ['A', 'F']
2      Second     ['B', 'Z']
3      Thrird     ['P', 'Q']
4      Fourth     ['H', 'J']

Name is just a string identifying the Words. And Words is a list of strings.

So what I want to do is have an expression that will identify which Paragraphs in Dataframe 1 contain Words from Dataframe 2. And I want to store the Name of the Words in a new column in Dataframe 1. The new column will contain a list of all the Names where a Word from Words occurred in the paragraph. The order does not matter and there must be no duplicates in the list.

For example:

New Dataframe 1:
ID      Paragraph             Names
1      'A B C D E'       [First, Second]
2      'A F G H L'       [First, Fourth]
3      'B J P Q W'   [Second, Third, Fourth]
4      'G F D S A'           [First]

I can only make a solution that has deeply nested for loops and takes a very long time to execute. Is there a solution that has a shorter computation time my thinking is maybe using loc and/or lambda functions.

Any help would be greatly appreciated!

Let me know if there is anything I need to clarify.

English is not my first language so I can try explain more if I need to.

Thank you

Here is the code for the dummy dataframes:

data_1 = {'Paragraph': ['A B C D E', 'A F G H L', 'B J P Q W', 'G F D S A']}
df_1 = pd.DataFrame(data_1)

data_2 = {'Name': ['First', 'Second', 'Third', 'Fourth'],
          'Words': [['A', 'F'], ['B', 'Z'], ['P', 'Q'], ['H', 'J']]}
df_2 = pd.DataFrame(data_2)
7
  • Can you provide the DataFrame constructor? Commented Oct 19, 2022 at 18:59
  • This is not the place where your task will be solved for you. Commented Oct 19, 2022 at 19:06
  • I am reading the dataframe from a CSV, so no constructor Commented Oct 19, 2022 at 19:18
  • @СергейКох I can solve the problem just not efficiently Commented Oct 19, 2022 at 19:28
  • 1
    @scotscotmcc I added the code for the dummy dataframes Commented Oct 19, 2022 at 19:41

1 Answer 1

1

You can split and explode Paragraph. Then map the names for each word of the exploded df_2. Finally, aggregate as set to have unique values:

s = df_2.explode('Words').set_index('Words')['Name']
df_1['Names'] = (df_1['Paragraph'].str.split()
                 .explode().map(s).dropna()
                 .groupby(level=0).agg(set)
                )

output:

   Paragraph                    Names
0  A B C D E          {Second, First}
1  A F G H L          {Fourth, First}
2  B J P Q W  {Third, Second, Fourth}
3  G F D S A                  {First}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.