0

I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com

I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- 
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}

df1['url']==df2['url']
Out[7]: False

Thanks.

0

3 Answers 3

3

URLs are not trivial to parse. Take a look at the urllib module in the standard library.

Here's how you could remove the path after the domain:

import urllib.parse

def remove_path(url):
    parsed = urllib.parse.urlparse(url)
    parsed = parsed._replace(path='')
    return urllib.parse.urlunparse(parsed)

df1['url'] = df1['url'].apply(remove_path)
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks Angus, when I run the code, I have the same error as above which is AttributeError: 'list' object has no attribute 'apply'. Can you share some of your insights on this? Thanks
In your question you mentioned Pandas so I assumed that df1 was a Pandas DataFrame. If you want to keep it as a dict of lists, you can do df1['url'] = list(map(remove_path, df1['url']))
Or, if you prefer, df1['url'] = [remove_path(url) for url in df1['url']]
oh yes you are right. In my real example, they are df and your codes work well!
but another problem is after I applied this, the 'www.cemexusa.com' in df2 is being removed entirely...
|
1

You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname

df1["Key"] = df1["URL"].apply(to_key)

Here is a full working example:

import pandas as pd
import io

from urllib.parse import urlsplit

df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")

df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")

df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname
    
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)

joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))

# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)

The output of print(joined) would be:

  Description                Key  Last Update
0   Junk Food  www.mcdonalds.com         2021
1       Cemex   www.cemexusa.com         2020

There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:

urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com

urlsplit("https://www.realpython.com").hostname  # also a valid URL
# www.realpython.com

What is the difference between urlparse and urlsplit?

It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.

[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit

3 Comments

Thanks Thomas for sharing the full working example along with the solution to specials cases. Really appreciated it.
You are welcome. I also just added a section about the difference between urlparse and urlsplit.
This is awesome! I also realized in my full dataset, I noted some url don't even have www. in front of it, I guess I will add another condition within the if statement, by saying if 'www' and 'http://' not in url, then url = f"www.{url}"
1

Use urlparse and isolate the hostname:

from urllib.parse import urlparse

urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

3 Comments

Thanks, when I run the below code, it throws me an error, can you give me more advice? AttributeError: 'list' object has no attribute 'apply'
never mind, it's because df1 is not a really df. If I convert it to a DataFrame, it works! Thanks.
I encountered an issue which is after I applied this, the 'www.cemexusa.com' is being removed entirely from df2...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.