3

I am writing a Scrapy program to extract the data.

This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:

/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]

While I am trying to execute this

try:
    temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
    print "temp_list:" + str(temp_list)
except:
    print "error"

It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.

4 Answers 4

9

The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.

You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html

Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for posting the excerpt from the Scrapy docs. It might improve the answer to suggest a replacement path which would work better.
Chrome adds <tbody>, too. Just got my code working after I saw this answer. Thanks!
3

I see that the element you are hunting for is inside a <table>.

Firefox adds tbody tag for every table, even if it does not exists in source HTML code. That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.

As suggested, use other anchors in your xpath query.

Comments

2

You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.

For the data you are matching, this XPath would do a lot better:

//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()

This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.

Comments

1

Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.

Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.