I am having a lot of trouble joining two pandas data frames, because the merge should be based on a partial string match. More specifically:
I have a dataframe called df with about 10,000 rows that look like this:
{
"writtenAt": "2015-01-01T18:31:01+00:00",
"content":" India\u2019s banks will ramp up sales of bonds that act as capital buffers in 2015"
}
Now, I have another dataframe called compNames with about 500 rows, which looks like this:
{
"ticker": "A",
"name": "Agilent Technologies Inc.",
"keyword": "Agilent"
}
I am trying to assign a ticker value from compNames to the matching entry of df by the following mechanism:
check if any item from the entire column
compNames['keyword']is contained in an entry ofdf['content']if there is a match, then return the matching word as a separate column of the
dfdataframe (e.g.df['matchedName'])if there are multiple matches, then create a list of matching words to the corresponding entry of
df['content']Finally, join
dfandcompNamesby usingdf['matchedName']andcompNames['keyword']as my key variables
What I have so far is:
# Load select company names
compNames = pd.read_csv("compNameList_LARA.txt")
compList = '|'.join(compNames['keyword'].tolist())
df['compMatch'] = df.content.str.contains(compList)
# drop unmatched articles
df = df[df['compMatch']==True]
# assign firm names
df['matchedName'] = df['content'].apply(lambda x: [x for x in compNames['keyword'].tolist() if x in df['content']])
However, when I do this, I get an empty list for the df['matchedName']
What went wrong?