1

I have 2 dataframes: df_mentions where I have urls, and media where I have info about some journals. I need to constantly update df_mentions with the info contained in media.

Mentions=['https://www.lemonde.fr/football/article/2019/07/08/coupe-du-monde-feminine-2109-au-sein-de-chaque-equipe-j-ai-vu-de-grandes-joueuses_5486741_1616938.html','https://www.telegraph.co.uk/world-cup/2019/06/12/womens-world-cup-2019-groups-complete-guide-teams-players-rankings/','https://www.washingtonpost.com/sports/dcunited/us-womens-world-cup-champs-arrive-home-ahead-of-parade/2019/07/08/48df1a84-a1e3-11e9-a767-d7ab84aef3e9_story.html?utm_term=.8f474bba8a1a']
Date=['08/07/2019','08/07/2019','08/07/2019']
Publication=['','','']
Country=['','','']
Foundation=['','','']
Is_in_media=['','','']
df_mentions=pd.DataFrame()
df_mentions['Mentions']=Mentions
df_mentions['Date']=Date
df_mentions['Source']=Source
df_mentions['Country']=Country
df_mentions['Foundation']=Foundation
df_mentions['Is_in_media']=Is_in_media

Source=['New York times','Lemonde','Washington Post']
Link=['https://www.nytimes.com/','https://www.lemonde.fr/','https://www.washingtonpost.com/']
Country=['USA','France','USA']
Foundation=['1851','1944','1877']
media=pd.DataFrame()
media['Source']=Source
media['Link']=Link
media['Country']=Country
media['Foundation']=Foundation
media

And they look like this (but with nearly 1000 rows daily) df_mentions

media

media

and I need to check if the source of the link is contained in media and extract the data from it to fill df_mentions and obtain the following result:

Expected: enter image description here

And what I have done is:

for index in range(0,len(media)):
    for index2 in range(0,len(df_mentions)):
        if str(media['Link'][index])in str(df_mentions['Mentions'][index2]):
            df_mentions['Publication'][index2]=media['Publication'][index]
            df_mentions['Country'][index2]=media['Country'][index]
            df_mentions['Foundation'][index2]=media['Foundation'][index]
            df_mentions['Is_in_media'][index2]='Yes'
        else:
            df_mentions['Is_in_media'][index2]='No'
df_mentions

But It works on my notebook once, and if I close the notebook gives me errors, I'm using Pandas 0.24.0. Is there a better way to do it and grant to work all times?

Thanks in advance! All help will be greatly appreciated!

1 Answer 1

1

One thing you can do is extract the URL in df_mentions and use it as a key for a merge

Starting data (removed the empty columns in df_mentions):

print(df_mentions)
                                            Mentions        Date
0  https://www.lemonde.fr/football/article/2019/0...  08/07/2019
1  https://www.telegraph.co.uk/world-cup/2019/06/...  08/07/2019
2  https://www.washingtonpost.com/sports/dcunited...  08/07/2019

print(media)
            Source                             Link Country Foundation
0   New York times         https://www.nytimes.com/     USA       1851
1          Lemonde          https://www.lemonde.fr/  France       1944
2  Washington Post  https://www.washingtonpost.com/     USA       1877

Create a new column containing the base url:

df_mentions['url'] = df_mentions['Mentions'].str.extract(r'(http[s]?:\/\/.+?\/)')

   Mentions                                   Date        url
0  https://www.lemonde.fr/football/articl...  08/07/2019  https://www.lemonde.fr/
1  https://www.telegraph.co.uk/world-cup/...  08/07/2019  https://www.telegraph.co.uk/
2  https://www.washingtonpost.com/sports/...  08/07/2019  https://www.washingtonpost.com/

Use that new column as a key when merging:

df_mentions.merge(media,
                  left_on='url',
                  right_on='Link',
                  how='left').drop(columns=['url', 'Link'])

   Mentions                                Date        Source           Country Foundation
0  https://www.lemonde.fr/football/art...  08/07/2019  Lemonde          France  1944     
1  https://www.telegraph.co.uk/world-c...  08/07/2019  NaN              NaN     NaN      
2  https://www.washingtonpost.com/spor...  08/07/2019  Washington Post  USA     1877 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! Sorry for the delay @Simon

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.