Need help on extracting data using Xpath in my Python code

Question

Question 1

Here is the HTML code.

<div class="preferredContact paragraph">ph:<span preferredcontact="40">(02) 9540 9959</span></div>

I am trying to extract that phone number using xpath.

I have tried

data['phone'] = c.xpath('.//span[@preferredContact="40"]/text()')

and

data['phone'] = c.xpath('.//span[contains(@preferredContact,"40")]/text()')

Both of them return only null. Can someone show me the code to extract that phone number, please?

Question 2

HTML code is

<a rel="nofollow" title="View website for Ruth Newman Architect (in new window)" target="_blank" name="listing_website" id="websiteLink40" alreadysentorpevent="false" class="links ext-no-tooltip orpDuplicateEvent" href="/app/redirect?headingCode=27898&amp;productId=473639214&amp;productVersion=1&amp;listingUrl=%2Fnsw%2Fgymea-bay%2Fruth-newman-architect-12781682-listing.html&amp;webSite=http%3A%2F%2Fwww.ruthnewman.com.au&amp;pt=w&amp;context=businessTypeSearch&amp;referredBy=YOL&amp;eventType=websiteReferral">www.ruthnewman.com.au
</a>

I want to get the link which is located next to the string webSite=http%3A%2F%2F. This string is in the href attribute's value. So, in the above example, I want www.ruthnewman.com.au. I do not know how to get that using Xpath.

Can someone help out please?

Hey Thanks, that worked! Any help with that second question? — Bhavani Kannan
– Bhavani Kannan, Commented Jan 23, 2012 at 20:47
I think I misunderstood the second question, at first. Let me know if my edited answer addresses it. — Wayne
– Wayne, Commented Jan 23, 2012 at 21:06

Wayne · Accepted Answer · 2012-01-23 21:06:07Z

1

Attributes are case-sensitive. For the first question use (no caps):

.//span[@preferredcontact='40']/text()

For the second question use:

substring-before(substring-after(
    .//a[contains(@href, 'webSite=')]/@href, 'webSite=http%3A%2F%2F'), '&')

This first selects everything after 'webSite=http%3A%2F%2F' in the attribute, then, using that as the input to substring-before, extracts everything before the first &, which should contain the target string.

Note that in your given examples the descendant-or-self (//) axis is not really needed. Try to avoid it whenever possible. The flexibility gained comes at the cost of precision and efficiency.

edited Jan 23, 2012 at 21:06

answered Jan 23, 2012 at 20:46

Wayne

60.5k15 gold badges135 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bhavani Kannan Over a year ago

I dont know why but substring-before(substring-after( .//a[contains(@href, 'webSite=')]/@href, 'webSite=http%3A%2F%2F'), '&') throws Invalid Syntax error.

Collectives™ on Stack Overflow

Need help on extracting data using Xpath in my Python code

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related