compare url with a customized pattern in python

Question

When I scrape web site for articles urls and get all <a> tags and get all href attributes, this list of urls has some links not for articles but links to other categories or any other pages within same domain so I need to do the following :

create a pattern for the url and match each url in the links list to this pattern so I can know is this url is article url or not

the pattern example is like:

link: "http://www.cnbc.com/2016/03/13/financial-times-china-rebuts-economy-doomsayers-on-debt-and.html"

pattern match: http://www.cnbc.com/(*)/(*)/(*)/(*).html

so the idea that replace any variable part of the link with (*)

the question is how to match link to pattern?

The first three (*) sections are numbers, so you can use [0-9]+. The last (*) section is a combination of letters and symbols, so you can use .+. — Shrey
– Shrey, Commented Mar 13, 2016 at 18:40
I made this pattern for the user who are not programmers so they can't convert the url to regex and this is just example and it is used with any site — Mohamed Yousof
– Mohamed Yousof, Commented Mar 13, 2016 at 18:43

tmthydvnprt · Accepted Answer · 2016-06-09 11:48:23Z

2

Regular Expression (`regex`) match

You can do this with a regex match.

import re

# Example url
url = 'http://www.cnbc.com/2016/03/13/financial-times-china-rebuts-economy-doomsayers-on-debt-and.html'
# Create a regex match pattern
pattern = r'http://www.cnbc.com/(.+)/(.+)/(.+)/(.+).html'
# Find match
m = re.match(pattern, url)
# Get Groups
m.groups()

('2016',
 '03',
 '13',
 'financial-times-china-rebuts-economy-doomsayers-on-debt-and')

edited Jun 9, 2016 at 11:48

answered Mar 13, 2016 at 18:42

tmthydvnprt

10.8k10 gold badges54 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shrey Over a year ago

You should consider replacing the * with + because it doesn't really make sense to match nothing within the / dividers.

Mohamed Yousof Over a year ago

it worked fine and I can also use "\d" with digits instead of (.*) , thanks

Collectives™ on Stack Overflow

compare url with a customized pattern in python

1 Answer 1

Regular Expression (`regex`) match

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Regular Expression (regex) match

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

Regular Expression (`regex`) match