1

Question 1

Here is the HTML code.

<div class="preferredContact paragraph">ph:<span preferredcontact="40">(02) 9540 9959</span></div> 

I am trying to extract that phone number using xpath.

I have tried

data['phone'] = c.xpath('.//span[@preferredContact="40"]/text()')

and

data['phone'] = c.xpath('.//span[contains(@preferredContact,"40")]/text()')

Both of them return only null. Can someone show me the code to extract that phone number, please?

Question 2

HTML code is

<a rel="nofollow" title="View website for Ruth Newman Architect (in new window)" target="_blank" name="listing_website" id="websiteLink40" alreadysentorpevent="false" class="links ext-no-tooltip orpDuplicateEvent" href="/app/redirect?headingCode=27898&amp;productId=473639214&amp;productVersion=1&amp;listingUrl=%2Fnsw%2Fgymea-bay%2Fruth-newman-architect-12781682-listing.html&amp;webSite=http%3A%2F%2Fwww.ruthnewman.com.au&amp;pt=w&amp;context=businessTypeSearch&amp;referredBy=YOL&amp;eventType=websiteReferral">www.ruthnewman.com.au
</a>

I want to get the link which is located next to the string webSite=http%3A%2F%2F. This string is in the href attribute's value. So, in the above example, I want www.ruthnewman.com.au. I do not know how to get that using Xpath.

Can someone help out please?

3
  • 1
    Spelling error: "preferredcontact" vs. "preferredContact". Commented Jan 23, 2012 at 20:21
  • Hey Thanks, that worked! Any help with that second question? Commented Jan 23, 2012 at 20:47
  • I think I misunderstood the second question, at first. Let me know if my edited answer addresses it. Commented Jan 23, 2012 at 21:06

1 Answer 1

1

Attributes are case-sensitive. For the first question use (no caps):

.//span[@preferredcontact='40']/text()

For the second question use:

substring-before(substring-after(
    .//a[contains(@href, 'webSite=')]/@href, 'webSite=http%3A%2F%2F'), '&')

This first selects everything after 'webSite=http%3A%2F%2F' in the attribute, then, using that as the input to substring-before, extracts everything before the first &, which should contain the target string.

Note that in your given examples the descendant-or-self (//) axis is not really needed. Try to avoid it whenever possible. The flexibility gained comes at the cost of precision and efficiency.

Sign up to request clarification or add additional context in comments.

1 Comment

I dont know why but substring-before(substring-after( .//a[contains(@href, 'webSite=')]/@href, 'webSite=http%3A%2F%2F'), '&') throws Invalid Syntax error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.