Parse HTML using LXML in Python

Question

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are many of these, and I want all of them in some tokenized form). Unfortunately the HTML is very large and a little complicated, so trying to crawl down the tree might take me some time to just sort out the nested elements. Is there an easy way to just retrieve this?

Thanks!

What is the problem, actually? You can get element attributes with the .attrib attribute, e.g. elem.attrib['href']. — Martijn Pieters
– Martijn Pieters, Commented Feb 2, 2013 at 15:58

Jon Clements · Accepted Answer · 2013-02-02 15:59:17Z

14

If you just want the href's for a tags, then use:

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']

answered Feb 2, 2013 at 15:59

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user1922956 Over a year ago

What does the //a/@href do? In my case, there are two spaces between a and href, not one.

Collectives™ on Stack Overflow

Parse HTML using LXML in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related