How to remove string after .com and "https://" from an URL in Python

Question

I need to merge two dataframe by using url as a primary key. However, there are some extra strings in the url like in df1, I have https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com

I need to remove the /us/en-us.html after the .com and the https:// from the url, so I can perform the merge using url between 2 dfs. Below is a simplified example. What would be the solution for this?

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- 
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}

df1['url']==df2['url']
Out[7]: False

Thanks.

Angus L'Herrou · Accepted Answer · 2021-10-14 14:51:22Z

3

URLs are not trivial to parse. Take a look at the urllib module in the standard library.

Here's how you could remove the path after the domain:

import urllib.parse

def remove_path(url):
    parsed = urllib.parse.urlparse(url)
    parsed = parsed._replace(path='')
    return urllib.parse.urlunparse(parsed)

df1['url'] = df1['url'].apply(remove_path)

answered Oct 14, 2021 at 14:51

Angus L'Herrou

4493 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user032020 Over a year ago

Thanks Angus, when I run the code, I have the same error as above which is AttributeError: 'list' object has no attribute 'apply'. Can you share some of your insights on this? Thanks

Angus L'Herrou Over a year ago

In your question you mentioned Pandas so I assumed that df1 was a Pandas DataFrame. If you want to keep it as a dict of lists, you can do df1['url'] = list(map(remove_path, df1['url']))

Angus L'Herrou Over a year ago

Or, if you prefer, df1['url'] = [remove_path(url) for url in df1['url']]

user032020 Over a year ago

oh yes you are right. In my real example, they are df and your codes work well!

user032020 Over a year ago

but another problem is after I applied this, the 'www.cemexusa.com' in df2 is being removed entirely...

|

Thomas · Accepted Answer · 2021-10-14 15:59:26Z

1

You can use urlparse as suggested by others, or you could also use urlsplit. However, both will not handle www.cemexusa.com. So if you do not need the scheme in your key, you could use something like this:

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname

df1["Key"] = df1["URL"].apply(to_key)

Here is a full working example:

import pandas as pd
import io

from urllib.parse import urlsplit

df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")

df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")

df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname
    
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)

joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))

# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)

The output of print(joined) would be:

  Description                Key  Last Update
0   Junk Food  www.mcdonalds.com         2021
1       Cemex   www.cemexusa.com         2020

There may be other special cases not handled in this answer. Depending on your data, you may also need to handle an omitted www:

urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com

urlsplit("https://www.realpython.com").hostname  # also a valid URL
# www.realpython.com

What is the difference between urlparse and urlsplit?

It depends on your use case and what information you would like to extract. Since you do not need the URL's params, I would suggest using urlsplit.

[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit

edited Oct 14, 2021 at 15:59

answered Oct 14, 2021 at 15:29

Thomas

10.8k17 gold badges54 silver badges103 bronze badges

3 Comments

user032020 Over a year ago

Thanks Thomas for sharing the full working example along with the solution to specials cases. Really appreciated it.

Thomas Over a year ago

You are welcome. I also just added a section about the difference between urlparse and urlsplit.

user032020 Over a year ago

This is awesome! I also realized in my full dataset, I noted some url don't even have www. in front of it, I guess I will add another condition within the if statement, by saying if 'www' and 'http://' not in url, then url = f"www.{url}"

user2390182 · Accepted Answer · 2021-10-14 15:22:49Z

1

Use urlparse and isolate the hostname:

from urllib.parse import urlparse

urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

edited Oct 14, 2021 at 15:22

answered Oct 14, 2021 at 14:50

user2390182

73.7k6 gold badges71 silver badges95 bronze badges

3 Comments

user032020 Over a year ago

Thanks, when I run the below code, it throws me an error, can you give me more advice? AttributeError: 'list' object has no attribute 'apply'

user032020 Over a year ago

never mind, it's because df1 is not a really df. If I convert it to a DataFrame, it works! Thanks.

user032020 Over a year ago

I encountered an issue which is after I applied this, the 'www.cemexusa.com' is being removed entirely from df2...

Collectives™ on Stack Overflow

How to remove string after .com and "https://" from an URL in Python

3 Answers 3

6 Comments

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related