3

I am using Phantomjs in selenium to scrape data from the link given in the snippet. While extracting the data with element.text in phantomjs(web_element), I am getting some blank values in between where as if I use chromedriver I was able to scrape all data.

I can only run using headless browser since I am running it in AWS Linux server

how can i scrape all the data without missing using phantomjs. Expecting some help here... thank you in advance

Below is the snippet attached

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
     "(KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities = dcap,service_args=['--ignore-ssl-errors=true', '--load-images=false'])
driver.get("http://www.myntra.com/Dresses/Casual-Collection/Casual-Collection-by-Debenhams-Purple-Floral-Print-Maxi-Dress/348207/buy")
driver.implicitly_wait(5)
try:
    driver.find_element_by_class_name("size-buttons-show-size-chart").click()
    driver.implicitly_wait(10)
    div_s = driver.find_elements_by_class_name("size-chart-cell")
    # div_s = driver.find_elements_by_xpath("""//*[@id="mountRoot"]/div/div/div/div[3]/div/div[2]/div[1]/table/tbody/tr""")
    size_data = ''
    for s in div_s:
        print str(s.text)
except NoSuchElementException:
    print "NoSuchElementException"

Modified output:

Size XS S M L XL XXL 3XL
Brand Size UK10 UK12 UK14 UK16 UK18 UK20 UK22
Hips (INCHES) 36 38 40 42.5 45.25 48 50.75
31 41.75 # most Element is missing/ not able to scrape ???
Bust (INCHES) 34.25 36.25 38 40 43.75 46.5 49.25

Actual table is : Size Chart

2
  • Maybe the waiting time is too short. Try to driver.implicitly_wait(30) Commented Dec 28, 2016 at 14:11
  • I have already tried with this... and this is not my question Commented Dec 28, 2016 at 14:13

2 Answers 2

1

Interesting problem. Using the textContent would actually work in this case:

for s in div_s:
    print(str(s.get_attribute("textContent")))

Differences between .text, textContent and other properties are perfectly described here:

Note that there is no point in calling the implicitly_wait() multiple times - it does not act as time.sleep() - meaning, it would not wait for a certain amount of time immediately - instead, it would only instruct the driver to set the "implicit wait" to the specified amount of seconds:

An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.

A better way to wait in this case would be to use Explicit Waits.

Sign up to request clarification or add additional context in comments.

2 Comments

@DineshSingh still not sure why .text was not able to retrieve the text of several cells. The table itself looks pretty normal - all the td elements have text nodes and they are not different from each other. Guess this is pretty much PhantomJS specific..
I also noticed that earlier. But in Chrome driver it is scraping absolutely fine and no text in cell where missing.... Need to what is the seen behind this ....?
0

I think i found the answer/reason behind it.

Thanks for your replay @alecxe i found my answer here...

The textContent property is "inhertied" from the Node interface of the DOM Core specification. The text property is "inherited" from the HTML5 HTMLAnchorElement interface and is specified as "must return the same value as the textContent IDL attribute".

The two are probably retained to converge different browser behaviour, the text property for script elements is defined slightly differently.

Note that the DOM specification is a general specification for any kind of document (e.g. HTML, XML, SGML, etc.) whereas HTML5 is specifically for HTML that leverages and extends the DOM Core in many respects (some might say it's a "super set" of a few DOM specs plus HTML plus …).

Note that "inherited" does not mean "prototype inheritance", just the more general meaning of inherited

Again Thank you for this...

Difference between text and textContent properties

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.