Parsing HTML with XPath, Python and Scrapy

Question

I am writing a Scrapy program to extract the data.

This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:

/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]

While I am trying to execute this

try:
    temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
    print "temp_list:" + str(temp_list)
except:
    print "error"

It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.

matiskay · Accepted Answer · 2012-01-19 13:37:57Z

9

The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.

You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html

Firefox, in particular, is known for adding <tbody> elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use <tbody> in your XPath expressions.

answered Jan 19, 2012 at 13:37

matiskay

2674 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jim DeLaHunt Over a year ago

Thanks for posting the excerpt from the Scrapy docs. It might improve the answer to suggest a replacement path which would work better.

Rajath Over a year ago

Chrome adds <tbody>, too. Just got my code working after I saw this answer. Thanks!

warvariuc · Accepted Answer · 2011-11-01 13:19:48Z

3

I see that the element you are hunting for is inside a <table>.

Firefox adds tbody tag for every table, even if it does not exists in source HTML code. That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.

As suggested, use other anchors in your xpath query.

answered Nov 1, 2011 at 13:19

warvariuc

60.1k45 gold badges183 silver badges234 bronze badges

Comments

alecxe · Accepted Answer · 2013-07-24 17:34:46Z

2

You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.

For the data you are matching, this XPath would do a lot better:

//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()

This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.

edited Jul 24, 2013 at 17:34

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

answered Oct 31, 2011 at 13:30

Sjaak Trekhaak

4,96633 silver badges39 bronze badges

Comments

halfer · Accepted Answer · 2011-10-31 08:58:42Z

1

Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.

Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

answered Oct 31, 2011 at 8:58

halfer

20.2k20 gold badges110 silver badges207 bronze badges

Collectives™ on Stack Overflow

Parsing HTML with XPath, Python and Scrapy

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related