pulling multiple values from python ElementTree with lxml and xpath

Question

I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.

I am web-scraping. The pages I am scraping have the following salient elements:

<div class='parent'>
   <span class='title'>
      <a>THIS IS THE TITLE</a>
   </span>
   <div class='copy'>
      <p>THIS IS THE COPY</p>
   </div>
</div>

My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')

Below is my code

## 'tree' is the ElementTree of the document I've just pulled 
xpath = "//div[@class='parent']"
filtered_html = tree.xpath(xpath)

arr = []

for i in filtered_html:

   title_filter = "//span[@class='author']/a/text()"  # xpath for title text
   copy_filter = "//div[@class='copy']/p/text()"      # xpath for copy text

   title = i.getroottree().xpath(title_filter)
   copy = i.getroottree().xpath(copy_filter)
   arr.append((title, copy))

I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.

What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.

I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?

Peter DeGlopper · Accepted Answer · 2013-05-24 16:17:14Z

Your core problem is that the getroottree call you're making on each text element resets you to running your xpath over the whole tree. getroottree does exactly what it sounds like - returns the root element tree of the element you call it on. If you leave that call out it looks to me like you'll get what you want.

I personally would use the iterfind method on the element tree for my main loop, and would probably use the findtext method on the resulting elements to ensure that I receive only one title and one copy.

My (untested!) code would look like this:

parent_div_xpath = "//div[@class='parent']"
title_filter = "//span[@class='title']/a"
copy_filter = "//div[@class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]

Alternately, you could skip explicit iteration entirely:

title_filter = "//div[@class='parent']/span[@class='title']/a/text()"
copy_filter = "//div[@class='parent']/div[@class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))

You might need to drop the text() call from the xpath and move it into a generator expression, I'm not sure offhand whether findall will respect it. If it doesn't, something like:

arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))

And you might need to tweak that xpath if having more than one title/copy pair in a parent div is a possibility.

Collectives™ on Stack Overflow

pulling multiple values from python ElementTree with lxml and xpath

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related