0

I have a regexp and i want to add output of regexp to my url

for exmaple

url = 'blabla.com'
r = re.findall(r'<p>(.*?</a>))

r output - /any_string/on/any/server/

but a dont know how to make get-request with regexp output

blabla.com/any_string/on/any/server/
2
  • 1
    You can start by rereading what you asked, and formulate better. Commented Aug 6, 2010 at 12:57
  • <p>(.*?</a>) seems weird. Did you mean <a>(.*?</a>) ?? In all cases, I don't really get your question. Could you reformulate ? Commented Aug 6, 2010 at 13:49

2 Answers 2

2

Don't use regex to parse html. Use a real parser.

I suggest using the lxml.html parser. lxml supports xpath, which is a very powerful way of querying structured documents. There's a ready-to-use make_links_absolute() method that does what you ask. It's also very fast.

As an example, in this question's page HTML source code (the one you're reading now) there's this part:

<li><a id="nav-tags" href="/tags">Tags</a></li>

The xpath expression //a[@id='nav-tags']/@href means: "Get me the href attribute of all <a> tags with id attribute equal to nav-tags". Let's use it:

from lxml import html

url = 'http://stackoverflow.com/questions/3423822/python-url-regexp'

doc = html.parse(url).getroot()
doc.make_links_absolute()
links = doc.xpath("//a[@id='nav-tags']/@href")
print links

The result:

['http://stackoverflow.com/tags']
Sign up to request clarification or add additional context in comments.

Comments

0

Just get beautiful soup:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing+a+Document

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
soup.findAll('p')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.