4

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah 

(there are many of these, and I want all of them in some tokenized form). Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. Is there an easy way to just retrieve this?

Thanks!

2
  • 1
    What is the problem, actually? You can get element attributes with the .attrib attribute, e.g. elem.attrib['href']. Commented Feb 2, 2013 at 15:58
  • If lxml breaks on the sources, try BeautifulSoup. Commented Feb 2, 2013 at 16:32

1 Answer 1

14

If you just want the href's for a tags, then use:

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']
Sign up to request clarification or add additional context in comments.

1 Comment

What does the //a/@href do? In my case, there are two spaces between a and href, not one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.