1

I am having trouble figuring out how to select part of an html link using regex

say the link is:

<a href="race?raceid=1234">Mushroom Cup</a>

I have figured out how to get the race id, but I cannot for the life of me figure out how to use a regular expression to find just 'Mushroom cup'. The best I can do is get 1234>Mushroom Cup.

I'm new to regular expressions and it is just too much for me to comprehend.

1
  • 2
    How much could the input vary? If you're extracting this data from several places in a large document, it might be worth using an HTML parser instead of regex. Commented Aug 19, 2013 at 20:59

2 Answers 2

1

something very much like

re.findall('<a href="race\?raceid=(\d+)">([^<]+)</a>',html_text)
Sign up to request clarification or add additional context in comments.

4 Comments

I am having trouble downloading beautiful soup (I have anaconda python package distribution), so thank you for this regex answer
please don't use regex for parsing html :)
if thats really all he needs its pretty easy to get with a regex ... although in general I certainly agree
@JoranBeasley yeah, I'll put you +1 for being nice to the OP and me :D
1

Don't ever use regex for parsing HTML. Instead use HTML parsers like lxml or BeautifulSoup.

Here's an example using BeautifulSoup:

import urlparse
from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<html>
<head>
    <title>Python regex url grab - Stack Overflow</title>
</head>
<body>
    <a href="race?raceid=1234">Mushroom Cup</a>
</body>
</html
""")

link = soup.find('a')
par = urlparse.parse_qs(urlparse.urlparse(link.attrs['href']).query)
print par['raceid'][0]   # prints 1234
print link.text   # prints Mushroom Cup

Note, that urlparse is used for getting link parameter's value. See more here: Retrieving parameters from a URL.

Also see:

Hope that helps.

2 Comments

oh...that seems nicer
+1 since I agree in general that parsing html with a regex is a bad idea, but it would be nice to demonstrate why this solution may be superior than the simple regex for the OP's question. I know there are several reasons not to use regex (mainly that html is a nested language and regex doesnt handle nesting so well (stateless))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.