I am almost certainly doing this horribly wrong, and the cause of my problem is my own ignorance, but reading python docs and examples isn't helping.
I am web-scraping. The pages I am scraping have the following salient elements:
<div class='parent'>
<span class='title'>
<a>THIS IS THE TITLE</a>
</span>
<div class='copy'>
<p>THIS IS THE COPY</p>
</div>
</div>
My objective is to pull the text nodes from 'title' and 'copy', grouped by their parent div. In the above example, I should like to retrieve a tuple ('THIS IS THE TITLE', 'THIS IS THE COPY')
Below is my code
## 'tree' is the ElementTree of the document I've just pulled
xpath = "//div[@class='parent']"
filtered_html = tree.xpath(xpath)
arr = []
for i in filtered_html:
title_filter = "//span[@class='author']/a/text()" # xpath for title text
copy_filter = "//div[@class='copy']/p/text()" # xpath for copy text
title = i.getroottree().xpath(title_filter)
copy = i.getroottree().xpath(copy_filter)
arr.append((title, copy))
I'm expecting filtered_html to be a list of n elements (which it is). I'm then trying to iterate over that list of elements and for each one, convert it to an ElementTree and retrieve the title and copy text with another xpath expression. So at each iteration, I'm expecting title to be a list of length 1, containing the title text for element i, and copy to be a corresponding list for the copy text.
What I end up with: at every iteration, title is a list of length n containing all elements in the document matching the title_filter xpath expression, and copy is a corresponding list of length n for the copy text.
I'm sure that by now, anyone who knows what they're doing with xpath and etree can recognise I'm doing something horrible and mistaken and stupid. If so, can they please tell me how I should be doing this instead?