0

I need some help writing a regex pattern which can find affiliated links from a webpage.

Example code:

import requests,re
from bs4 import BeautifulSoup
res = requests.get('https://www.example.com')
soup = BeautifulSoup(res.text,'lxml')
links = soup.find_all('a', href=True)

# example_of_affiliate_links = ['http://example.com/click/click?p=1&t=url&s=IDHERE&url=https://www.mywebsite.com/920&f=TXL&name=electronic/ps4/','https://example.net/click/camref:IDhere/destination:https://www.mywebsite.com/product/138/sony-ps4.html']

I want to collect all affiliated links for "mywebsite.com", using the following regex pattern, but it is not capturing any links.

pattern = re.compile(r'([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)')

Is there a better way to do this?

2
  • Here is a site I use regularly to build / test regex patterns. Commented Apr 14, 2020 at 11:40
  • This one is also pretty cool to have visual representations of your regex Commented Apr 14, 2020 at 11:45

1 Answer 1

1

Here's the regex you're looking for:

https?://www.mywebsite.com\S*$

What's wrong with your regex?

([http,https]://www.mywebsite.com\S[\.html,\.php,\&]$)
  • The braces on each sides are useless
  • [] means any of those characters, so in [http,https], you're looking of one character, which might be "h", "t", "t", "p", "s" or ","
  • \S only captures one character, your need to add a multiplier after it
  • Same thing goes for the [\.html,\.php,\&] part
Sign up to request clarification or add additional context in comments.

2 Comments

this fails on example.com/click/…
this fails if any parameters are placed after .html eg https://example.net/click/camref:IDhere/destination:https://www.mywebsite.com/product/138/sony-ps4.html&q=ps4

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.