Python: Regex to find associated HTML links

Question

I need some help writing a regex pattern which can find affiliated links from a webpage.

Example code:

import requests,re
from bs4 import BeautifulSoup
res = requests.get('https://www.example.com')
soup = BeautifulSoup(res.text,'lxml')
links = soup.find_all('a', href=True)

# example_of_affiliate_links = ['http://example.com/click/click?p=1&t=url&s=IDHERE&url=https://www.mywebsite.com/920&f=TXL&name=electronic/ps4/','https://example.net/click/camref:IDhere/destination:https://www.mywebsite.com/product/138/sony-ps4.html']

I want to collect all affiliated links for "mywebsite.com", using the following regex pattern, but it is not capturing any links.

pattern = re.compile(r'([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)')

Is there a better way to do this?

Here is a site I use regularly to build / test regex patterns. — s3dev
– s3dev, Commented Apr 14, 2020 at 11:40
This one is also pretty cool to have visual representations of your regex — Zorzi
– Zorzi, Commented Apr 14, 2020 at 11:45

Zorzi · Accepted Answer · 2020-04-14 12:49:34Z

1

Here's the regex you're looking for:

https?://www.mywebsite.com\S*$

What's wrong with your regex?

([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)

The braces on each sides are useless
[] means any of those characters, so in [http,https], you're looking of one character, which might be "h", "t", "t", "p", "s" or ","
\S only captures one character, your need to add a multiplier after it
Same thing goes for the [\.html,\.php,\&] part

edited Apr 14, 2020 at 12:49

answered Apr 14, 2020 at 11:46

Zorzi

7925 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

hadesfv Over a year ago

this fails on example.com/click/…

user11322373 Over a year ago

this fails if any parameters are placed after .html eg https://example.net/click/camref:IDhere/destination:https://www.mywebsite.com/product/138/sony-ps4.html&q=ps4

Collectives™ on Stack Overflow

Python: Regex to find associated HTML links

Example code:

1 Answer 1

What's wrong with your regex?

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Example code:

1 Answer 1

What's wrong with your regex?

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related