1

When I scrape web site for articles urls and get all <a> tags and get all href attributes, this list of urls has some links not for articles but links to other categories or any other pages within same domain so I need to do the following :

create a pattern for the url and match each url in the links list to this pattern so I can know is this url is article url or not

the pattern example is like:

link: "http://www.cnbc.com/2016/03/13/financial-times-china-rebuts-economy-doomsayers-on-debt-and.html"

pattern match: http://www.cnbc.com/(*)/(*)/(*)/(*).html

so the idea that replace any variable part of the link with (*)

the question is how to match link to pattern?

4
  • Use [^/]+ instead of *, and escape the dot. Commented Mar 13, 2016 at 18:39
  • 1
    The first three (*) sections are numbers, so you can use [0-9]+. The last (*) section is a combination of letters and symbols, so you can use .+. Commented Mar 13, 2016 at 18:40
  • I made this pattern for the user who are not programmers so they can't convert the url to regex and this is just example and it is used with any site Commented Mar 13, 2016 at 18:43
  • What code do you have and what have you tried? Commented Mar 13, 2016 at 18:45

1 Answer 1

2

Regular Expression (regex) match

You can do this with a regex match.

import re

# Example url
url = 'http://www.cnbc.com/2016/03/13/financial-times-china-rebuts-economy-doomsayers-on-debt-and.html'
# Create a regex match pattern
pattern = r'http://www.cnbc.com/(.+)/(.+)/(.+)/(.+).html'
# Find match
m = re.match(pattern, url)
# Get Groups
m.groups()

('2016',
 '03',
 '13',
 'financial-times-china-rebuts-economy-doomsayers-on-debt-and')
Sign up to request clarification or add additional context in comments.

2 Comments

You should consider replacing the * with + because it doesn't really make sense to match nothing within the / dividers.
it worked fine and I can also use "\d" with digits instead of (.*) , thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.