Merge two pandas dataframes based on partial string match

Question

I am having a lot of trouble joining two pandas data frames, because the merge should be based on a partial string match. More specifically:

I have a dataframe called df with about 10,000 rows that look like this:

{
    "writtenAt": "2015-01-01T18:31:01+00:00",
    "content":" India\u2019s banks will ramp up sales of bonds that act as capital buffers in 2015"
}

Now, I have another dataframe called compNames with about 500 rows, which looks like this:

{
    "ticker": "A",
    "name": "Agilent Technologies Inc.",
    "keyword": "Agilent"
}

I am trying to assign a ticker value from compNames to the matching entry of df by the following mechanism:

check if any item from the entire column compNames['keyword'] is contained in an entry of df['content']
if there is a match, then return the matching word as a separate column of the df dataframe (e.g. df['matchedName'])
if there are multiple matches, then create a list of matching words to the corresponding entry of df['content']
Finally, join df and compNames by using df['matchedName'] and compNames['keyword'] as my key variables

What I have so far is:

# Load select company names
compNames = pd.read_csv("compNameList_LARA.txt")
compList = '|'.join(compNames['keyword'].tolist())
df['compMatch'] = df.content.str.contains(compList)

# drop unmatched articles
df = df[df['compMatch']==True]

# assign firm names
df['matchedName'] = df['content'].apply(lambda x: [x for x in   compNames['keyword'].tolist() if x in df['content']])

However, when I do this, I get an empty list for the df['matchedName']

What went wrong?

Please provide a reproducible pandas example including desired output. From what I can tell, these two example datasets have no matches, so the result wouldn't be interesting. Add some more data that includes matches to clarify the question. — wjandrea
– wjandrea, Commented Feb 27 at 22:29

Jin Lee · Accepted Answer · 2016-10-30 17:52:12Z

7

Figured it out. I just needed to do:

df['content'] = df['content'].str.lower().str.split()
df['matchedName'] = df['content'].apply(lambda x: [item for item in x if item in compNames['keyword'].tolist()])

answered Oct 30, 2016 at 17:52

Jin Lee

1011 silver badge4 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Merge two pandas dataframes based on partial string match

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related