1

I'm working on a small web scraping project that uses selenium wherein I scrape some product information from a clothes website: (https://www.asos.com/us/search/?q=shirt), I've been able to get most of the product information after quite some trial and error but I'm having a bit of issues with scraping the src images from the page source - which I believe is relatively similar in approach as the other things I've been able to get like product name, price, url of product page, etc. The following is the code snippet where I try to scrape the images from the page:

imgsSrc = set()
containers = driver.find_elements(By.CLASS_NAME, "productMediaContainer_kmkXR")
     for container in containers:
          image = container.find_element(By.TAG_NAME, 'img')
          print(image.get_attribute('src'))
          imgsSrc.add(image.get_attribute('src'))

This works for about the first 8 products give or take, however it then fails - at least from the information I've been able to find of similar situations this could be due to the site using "lazy loading" for the img tags classes. Products on the page source usually ~ the 8th product entry have differing image tag classes to the effect of : Lazy img class name and so on and I think this is where it's failing to grab the rest of the images as they each are of this class thereon.

I'm unsure if it matters but in my case before scraping anything on the page my program is also clicking the load more button on the page until (if possible) I have ~ 216 products displayed and setting the filter used for the products as a user inputted filter.

A few things I've tried is having the driver scroll to the page end before scraping the images, but I'm unsure of if it's a case of the images only loading on the page source when they're in the viewport or not.

From my understanding, the div class ("productMediaContainer_kmkXR") I'm pulling from each product isn't lazy loaded, however the img class contained within it could be. (It's also possible that instead of an img tag there's a video class tag associated with the product that still has an image in a member called "poster" like so: Video Tag)

Currently, I'm just trying to figure out how to get ALL the images for the products on the page. I'm just unsure if it's due to not gradually scrolling the page while scraping or something else.

4
  • if lazy loading is the problem, the simple solution is to use a long enough sleep before getting containers. (Usually a Stale Element exception would be thrown if this were the case... the DOM would still be updating while iterating through the containers, so a call to get_attribute may throw that.) Commented Jan 10, 2024 at 17:31
  • @pcalkins I was testing a bit more and printed the length of the container list I was getting back, I'm definitely getting all of the containers, but after the first 8 containers it blows up due to: NoSuchElementException: Message: no such element: Unable to locate element: {"method":"tag name","selector":"img"} given this, I'm guessing that after the 8th product, 8th container, the img tags for subsequent products just aren't available which is confusing to me as when inspecting the page source the img tags are there regardless it's just the class name that changes from the 8th product on. Commented Jan 10, 2024 at 18:42
  • they might be there when you inspect, but not when your script gets the containers. It is sort of odd that stale element isn't being thrown though. Commented Jan 10, 2024 at 18:51
  • @pcalkins Did a bit more testing - and tried to catch the No such Element exceptions, I WAS able to actually get more than just the first 8 products, my script moves the page scroll around when it clicks the load more button, but I ultimately for testing purposes had it scroll to the bottom before starting any scraping - the significance of this is that it SEEMS like I CAN get more than just the first 8 images, it seems to just solely be dependent on them loading around wherever the viewport is from the scrolling. I'm gonna try to slowly scroll the page while scraping again with a longer sleep Commented Jan 10, 2024 at 19:00

2 Answers 2

2

After quite a bit of testing frankensteining various solutions to similar problems the following is a solution I landed on for my purposes:

Firstly, my issue was seemingly that I needed to fully scroll the page with the webdriver slow enough so that each image could load as I BELIEVE they were indeed lazy loaded the following code snippet is what I used to do so, it's a bit slow but can likely be tweaked to be faster and still work:

        imgsSrc = []
        driver.execute_script("window.scrollTo(0, 0);") #Go to top of page
        SCROLL_PAUSE_TIME = 2 #How long to wait between scrolls
        while True:
            previous_scrollY = driver.execute_script('return window.scrollY')
            #driver.execute_script('window.scrollBy( 0, 400 )' ) #Alternative scroll, a bit slower but reliable
            html = driver.find_element(By.TAG_NAME, 'html')
            html.send_keys(Keys.PAGE_DOWN)
            html.send_keys(Keys.PAGE_DOWN)
            html.send_keys(Keys.PAGE_DOWN) #Faster scroll, inelegant but works (Could translate to value scroll like above)
            time.sleep(SCROLL_PAUSE_TIME) #Give images a bit of time to load by waiting

            # Calculate new scroll height and compare with last scroll height
            if previous_scrollY == driver.execute_script('return window.scrollY'):
                break

This should scroll to the bottom of the page slowly and in doing so allow the images in the containers to load for scraping, as I initially mentioned I knew the particular page I was using for testing COULD potentially have a video tag rather than an image, however an image could be pulled from a member variable of it called 'poster' the following is how I handled scraping the images and handling the video case:

        missingCount = 0 #How many images did we miss (Testing purposes)
        containers = driver.find_elements(By.CLASS_NAME, "productMediaContainer_kmkXR")
        print(len(containers)) #Make sure we're getting all the containers
        for container in containers:
            try:
                image = container.find_element(By.TAG_NAME, 'img')
                print(image.get_attribute('src'))
                imgsSrc.append(image.get_attribute('src'))
            except NoSuchElementException: #Ideally in this case it's a video rather than an image (otherwise we didn't give it time to load)              
                print("Whoops - Check if video")
                try:
                    image = container.find_element(By.TAG_NAME,'video')
                    print(image.get_attribute('poster'))
                    imgsSrc.append(image.get_attribute('poster'))
                except NoSuchElementException: #It wasn't a video - OR we didn't give it enough time to load
                    missingCount += 1
                    print("We're really broken")

        print(missingCount)

Thank you to everyone for their answers, and good luck to future readers who stumble upon this - I hope it's helpful, in my case I had to do a good deal of troubleshooting and piecing together of similar issues others were having.

Sign up to request clarification or add additional context in comments.

1 Comment

I get the point above that you need to scroll slowly for the loading to happen fully. But could you speed up the loading itself?
1

you can get product information, which includes url of images, from script part. then download images directly using url:

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.