You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
Here is a full working example:
import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
The output of print(joined) would be:
Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:
urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
What is the difference between urlparse and urlsplit?
It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.
[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit