Python regex url grab

Question

I am having trouble figuring out how to select part of an html link using regex

say the link is:

<a href="race?raceid=1234">Mushroom Cup</a>

I have figured out how to get the race id, but I cannot for the life of me figure out how to use a regular expression to find just 'Mushroom cup'. The best I can do is get 1234>Mushroom Cup.

I'm new to regular expressions and it is just too much for me to comprehend.

How much could the input vary? If you're extracting this data from several places in a large document, it might be worth using an HTML parser instead of regex. — user1726343
– user1726343, Commented Aug 19, 2013 at 20:59

Joran Beasley · Accepted Answer · 2013-08-19 21:02:59Z

1

something very much like

re.findall('<a href="race\?raceid=(\d+)">([^<]+)</a>',html_text)

answered Aug 19, 2013 at 21:02

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

amchugh89 Over a year ago

I am having trouble downloading beautiful soup (I have anaconda python package distribution), so thank you for this regex answer

alecxe Over a year ago

please don't use regex for parsing html :)

Joran Beasley Over a year ago

if thats really all he needs its pretty easy to get with a regex ... although in general I certainly agree

alecxe Over a year ago

@JoranBeasley yeah, I'll put you +1 for being nice to the OP and me :D

Community · Accepted Answer · 2017-05-23 10:25:27Z

1

Don't ever use regex for parsing HTML. Instead use HTML parsers like lxml or BeautifulSoup.

Here's an example using BeautifulSoup:

import urlparse
from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<html>
<head>
    <title>Python regex url grab - Stack Overflow</title>
</head>
<body>
    <a href="race?raceid=1234">Mushroom Cup</a>
</body>
</html
""")

link = soup.find('a')
par = urlparse.parse_qs(urlparse.urlparse(link.attrs['href']).query)
print par['raceid'][0]   # prints 1234
print link.text   # prints Mushroom Cup

Note, that urlparse is used for getting link parameter's value. See more here: Retrieving parameters from a URL.

Also see:

Hope that helps.

edited May 23, 2017 at 10:25

CommunityBot

11 silver badge

answered Aug 19, 2013 at 21:05

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

2 Comments

amchugh89 Over a year ago

oh...that seems nicer

Joran Beasley Over a year ago

+1 since I agree in general that parsing html with a regex is a bad idea, but it would be nice to demonstrate why this solution may be superior than the simple regex for the OP's question. I know there are several reasons not to use regex (mainly that html is a nested language and regex doesnt handle nesting so well (stateless))

Collectives™ on Stack Overflow

Python regex url grab

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related