Matching url in HTML using regex

Question

It's been a while since I've used regex, and I feel like this should be simple to figure out.

I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.

import re
string_to_match = '<a href="/ncf/teams/roster?teamId=58">Roster</a>'
re.findall('<a href="/ncf/teams/roster?teamId=(/d+)">Roster</a>',string_to_match)

Why, why, why do people keep trying to parse HTML with regular expressions?!? Use an HTML parser. It can find the tags you care about with the expected attributes, pull it out for you, and actually parse the URL to get the GET parameters, which will be correct and largely self-documenting code. Even if the regex might be faster, unmaintainable and possibly wrong code is not an improvement. — ShadowRanger
– ShadowRanger, Commented Jan 19, 2017 at 3:50
Possible duplicate of RegEx match open tags except XHTML self-contained tags — MK.
– MK., Commented Jan 19, 2017 at 4:15

alecxe · Accepted Answer · 2017-01-19 03:59:54Z

1

Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:

import re
from bs4 import BeautifulSoup

data = """
<body>
    <a href="/ncf/teams/roster?teamId=58">Roster</a>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]

print(re.search(r"teamId=(\d+)", link).group(1))

Prints 58.

edited Jan 19, 2017 at 3:59

answered Jan 19, 2017 at 3:54

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

xvan · Accepted Answer · 2017-01-19 04:11:44Z

0

I would recommend using BeautifulSoup or lxml, it's worth the learning curve.

...But if you still want to use regexp

re.findall('href="[^"]*teamId=(\d+)',string_to_match)

answered Jan 19, 2017 at 4:11

xvan

4,8491 gold badge25 silver badges41 bronze badges

Collectives™ on Stack Overflow

Matching url in HTML using regex

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related