Scraping web data using PhantomJS and Selenium

Question

I am using Phantomjs in selenium to scrape data from the link given in the snippet. While extracting the data with element.text in phantomjs(web_element), I am getting some blank values in between where as if I use chromedriver I was able to scrape all data.

I can only run using headless browser since I am running it in AWS Linux server

how can i scrape all the data without missing using phantomjs. Expecting some help here... thank you in advance

Below is the snippet attached

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
     "(KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities = dcap,service_args=['--ignore-ssl-errors=true', '--load-images=false'])
driver.get("http://www.myntra.com/Dresses/Casual-Collection/Casual-Collection-by-Debenhams-Purple-Floral-Print-Maxi-Dress/348207/buy")
driver.implicitly_wait(5)
try:
    driver.find_element_by_class_name("size-buttons-show-size-chart").click()
    driver.implicitly_wait(10)
    div_s = driver.find_elements_by_class_name("size-chart-cell")
    # div_s = driver.find_elements_by_xpath("""//*[@id="mountRoot"]/div/div/div/div[3]/div/div[2]/div[1]/table/tbody/tr""")
    size_data = ''
    for s in div_s:
        print str(s.text)
except NoSuchElementException:
    print "NoSuchElementException"

Modified output:

Size XS S M L XL XXL 3XL
Brand Size UK10 UK12 UK14 UK16 UK18 UK20 UK22
Hips (INCHES) 36 38 40 42.5 45.25 48 50.75
31 41.75 # most Element is missing/ not able to scrape ???
Bust (INCHES) 34.25 36.25 38 40 43.75 46.5 49.25

Actual table is :

Maybe the waiting time is too short. Try to driver.implicitly_wait(30) — Guandan Chen
– Guandan Chen, Commented Dec 28, 2016 at 14:11
I have already tried with this... and this is not my question — Dinu Duke
– Dinu Duke, Commented Dec 28, 2016 at 14:13

Community · Accepted Answer · 2017-05-23 10:30:31Z

1

Interesting problem. Using the textContent would actually work in this case:

for s in div_s:
    print(str(s.get_attribute("textContent")))

Differences between .text, textContent and other properties are perfectly described here:

Note that there is no point in calling the implicitly_wait() multiple times - it does not act as time.sleep() - meaning, it would not wait for a certain amount of time immediately - instead, it would only instruct the driver to set the "implicit wait" to the specified amount of seconds:

An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.

A better way to wait in this case would be to use Explicit Waits.

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Dec 28, 2016 at 14:17

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alecxe Over a year ago

@DineshSingh still not sure why .text was not able to retrieve the text of several cells. The table itself looks pretty normal - all the td elements have text nodes and they are not different from each other. Guess this is pretty much PhantomJS specific..

Dinu Duke Over a year ago

I also noticed that earlier. But in Chrome driver it is scraping absolutely fine and no text in cell where missing.... Need to what is the seen behind this ....?

Community · Accepted Answer · 2017-05-23 11:55:16Z

I think i found the answer/reason behind it.

Thanks for your replay @alecxe i found my answer here...

The textContent property is "inhertied" from the Node interface of the DOM Core specification. The text property is "inherited" from the HTML5 HTMLAnchorElement interface and is specified as "must return the same value as the textContent IDL attribute".

The two are probably retained to converge different browser behaviour, the text property for script elements is defined slightly differently.

Note that the DOM specification is a general specification for any kind of document (e.g. HTML, XML, SGML, etc.) whereas HTML5 is specifically for HTML that leverages and extends the DOM Core in many respects (some might say it's a "super set" of a few DOM specs plus HTML plus …).

Note that "inherited" does not mean "prototype inheritance", just the more general meaning of inherited

Again Thank you for this...

Difference between text and textContent properties

Collectives™ on Stack Overflow

Scraping web data using PhantomJS and Selenium

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related