match web url with python regex

Question

I am trying to extra web links from web content with Python regex. here's my python script

webUrlList = re.findall(r"(?<=<a href=\").+(.html|/)(?=\")", content)
print webUrlList

and the matched webUrlList is like:

['/', '.html', '/', '/', '/', '/',...]

please help me find out the reason why this script yield the above output.

target weburl strings samples:

<a href="http://ab.test.com/flower/1111027378112/purple/119735281586093.html"

<a href="/abcabcdef/coffee/su1/"

I'm having trouble reproducing the output your citing. When using the regex that you supplied, r"(?<=<a href=\").+(.html|/)(?=\")", I'm only getting ['.html'] and not any forward-slash characters. — wpcarro
– wpcarro, Commented Jul 3, 2016 at 17:53
Just make the capturing group a noncapturing one. And use lazy dot matching. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 3, 2016 at 18:01

wpcarro · Accepted Answer · 2016-07-03 19:29:41Z

2

If you're only parsing for links, and you're familiar with the content you will be parsing, the following regex should help you accomplish what you're after and is pretty safe.

regex = re.compile(r'href="([^"]+)')
results = re.findall(regex, <CONTENT-HERE>)

href=" consumes but doesn't capture the literal characters href="
([^"]+) consumes and captures any character that isn't a quotation mark

Run a few trials with the content you are scraping and assess whether you need more specificity in the regex or not.

edited Jul 3, 2016 at 19:29

answered Jul 3, 2016 at 18:04

wpcarro

1,54610 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wiktor Stribiżew Over a year ago

You are using re.findall. r'href="([^"]+)' is enough.

wpcarro Over a year ago

@WiktorStribiżew indeed it is. Good catch. I'll modify the answer above.

Community · Accepted Answer · 2017-05-23 12:08:29Z

1

Use a html parser like BeautifulSoup:

soup = BeautifulSoup(content, "html.parser")

print([a["href"] for a in soup.find_all("a", href=True)])

Don't use a regex to parse html

edited May 23, 2017 at 12:08

CommunityBot

11 silver badge

answered Jul 3, 2016 at 17:42

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

10 Comments

wpcarro Over a year ago

This requires adding an additional module, BeautifulSoup, to the project. I understand that there may be better tools to parse HTML than regular expressions. But this question is asking for extracting web links using regular expressions. So while your answer works and is elegant, it seems to side-step the what's being asked.

Padraic Cunningham Over a year ago

@wcarroll,stackoverflow.com/questions/1732348/… you should not use regex to parse html, there is no side-stepping what is being asked, it is the correct approach to what is essentially being asked.

wpcarro Over a year ago

I almost included in my comment "yes I have seen the infamous SO post". I guess I should have been explicit. This doesn't change my comment above. If he is only parsing small strings that contain HTML, regular expressions are fit for the task and I think preferable to including a third-party module and learning its API.

Padraic Cunningham Over a year ago

@wcarroll, where does it say they are parsing small strings that contain HTML, I am trying to extra web links from web content seems pretty clear that they are parsing the full content returned. I am not going to encourage anyone to parse html with a regex and anyone that does is leading someone down a bad path

M. Timtow Over a year ago

But he is not parsing the whole html to make a DOM either, isn't it ok to look for http URIs in it ? If no, why exactly ?

|

Collectives™ on Stack Overflow

match web url with python regex

2 Answers 2

2 Comments

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related