2

I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.

I am web-scraping. The pages I am scraping have the following salient elements:

<div class='parent'>
   <span class='title'>
      <a>THIS IS THE TITLE</a>
   </span>
   <div class='copy'>
      <p>THIS IS THE COPY</p>
   </div>
</div>

My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')

Below is my code

## 'tree' is the ElementTree of the document I've just pulled 
xpath = "//div[@class='parent']"
filtered_html = tree.xpath(xpath)

arr = []

for i in filtered_html:

   title_filter = "//span[@class='author']/a/text()"  # xpath for title text
   copy_filter = "//div[@class='copy']/p/text()"      # xpath for copy text

   title = i.getroottree().xpath(title_filter)
   copy = i.getroottree().xpath(copy_filter)
   arr.append((title, copy))

I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.

What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.

I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?

1 Answer 1

2

Your core problem is that the getroottree call you're making on each text element resets you to running your xpath over the whole tree. getroottree does exactly what it sounds like - returns the root element tree of the element you call it on. If you leave that call out it looks to me like you'll get what you want.

I personally would use the iterfind method on the element tree for my main loop, and would probably use the findtext method on the resulting elements to ensure that I receive only one title and one copy.

My (untested!) code would look like this:

parent_div_xpath = "//div[@class='parent']"
title_filter = "//span[@class='title']/a"
copy_filter = "//div[@class='copy']/p"
arr = [(i.findtext(title_filter), i.findtext(copy_filter)) for i in tree.iterfind(parent_div_xpath)]

Alternately, you could skip explicit iteration entirely:

title_filter = "//div[@class='parent']/span[@class='title']/a/text()"
copy_filter = "//div[@class='parent']/div[@class='copy']/p/text()"
arr = izip(tree.findall(title_filter), tree.findall(copy_filter))

You might need to drop the text() call from the xpath and move it into a generator expression, I'm not sure offhand whether findall will respect it. If it doesn't, something like:

arr = izip(title.text for title in tree.findall(title_filter), copy.text for copy in tree.findall(copy_filter))

And you might need to tweak that xpath if having more than one title/copy pair in a parent div is a possibility.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.