1

I'm currently web scraping my university webpage to download unit content. I've figured out how to collect the names/links to each unit, and am now currently trying to figure out how to collate the names/links to each individual module within a unit.

A rough description of the HTML on the modules page.

<ul id="content_listContainer" class="contentList">
    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>

    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>
</ul>

So I am trying to grab the link inside the href attribute of the <a> tag within li/div/h3 and the name of the module within the span inside the <a> tag. Here is the relevant code snippet.

    modules = []
   
    driver.get(unit_url)

    module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")    #Grab the ul list

    li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']")  #Grab each li item

    for item in li_items[1:]:              #Skips first li tag as that is the Overview, not a module

        module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href') 
                                                      #These are not moving on from the first module for some reason...
        module_name = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a/span").text

        module = {
            "name": module_name,
            "url": module_url
        }

        modules.append(module)

The issue/question:

Edit

I've tried @sushii and @QHarr solutions with no luck unfortunately. I should point out that the lines grabbing module_name and module_url within the for loop are returning the same first module data every LOOP. I've tested it with a different unit where the first couple <li> tags are non-modules (introduction) and that should be returned but it is still only returning the same module 1.

/edit

Edit 2

Here is a link to the html I am trying to scrape. This isn't the entire page as that would be way to big.

<html><body><div></div><div></div><div></div><div> This is the DIV that is in the link </div><div></div><div></div></body></html>

I have verified that li_items definitely contains the <li> tags I need so the other HTML shouldn't be important (I think).

If you scroll about a quarter way down the <li> tags I need are bolded and the information I need to scrape is underlined.

/Edit 2

The lines that grab the module_name and module_url within the for loop are only grabbing the info for the first module.

I have verified through debugging that li_items does contain all the li items and is not just grabbing the first one. I'm new to Selenium so my thinking is that there is something wrong with the xpath I have provided but it should only be grabbing the tags within the item iterable object. So I am confused as to why it keep grabbing the first li item's info.

Answer Edit

Using @Sariq Shaikh 's answer I've solved the issue. Initially his technique using indexing [] of the elements to iterate over the <li> tags wasn't working but after altering the XPATH used for module_url and module_name to include the <ul> tag and then using indexing with the <li> tag has solved my issue.

However I still do not undestand why the original method was not working. Here is the altered code.

    module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")

    ctr = 1

    for _ in module_ul.find_elements_by_tag_name('li'):
        
        try:

            module_url = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a').get_attribute('href') #These are not moving on from the first module for some reason...

            module_name = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a/span').text
        
        except SelException.NoSuchElementException:

            print("NoSuchElementException\n")
            ctr += 1
            continue

4 Answers 4

1

To grab all the list items iteratively you can use xpath with index as shown below.

(//div[@class='item clearfix'])[1] #first li item index starts from 1 not 0
(//div[@class='item clearfix'])[2] #second li item
(//div[@class='item clearfix'])[3] #third li item
(//div[@class='item clearfix'])[4] #fourth li item

After getting each li item using index you can access its child elements according to their presence in the xpath as shown below.

(//div[@class='item clearfix'])[1]/h3/a #first li's h3/a tag

Considering this you can update your code as shown below to use a simple counter to get lists elements based on index.

modules = []
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")    #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']")  #Grab each li item

counter = 1 #use counter to iterate over all the li items based on index
for item in li_items:
    #append counter values as index for list items in xpath
    module_url = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a").get_attribute('href') 
    module_name = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a/span").text

    module = {
           "name": module_name,
           "url": module_url
    }

    modules.append(module)
    counter= counter + 1
    
#remove the first item from the list as its not required
modules.pop(0)
print(modules)
Sign up to request clarification or add additional context in comments.

6 Comments

Hi Sariq, I have attempted the solution and it seems that xpath returns with the NoSuchElementException. I can confirm that the li_items does indeed contain each of the<li> elements inside as a list of WebElement objects. In the loop I can confirm that the item does contain the individual WebElement objects for each <li> tag. The issue seems to be with the XPATH I am using to find the tags I need within each of the <li> tag WebElement objects.
If you are getting nosuchelementexception you can put explicit wait to wait for element first, explained here allselenium.info/wait-for-elements-python-selenium-webdriver
Thing is the error occurs when I'm trying to iterate over the list of <li> tags in li_items so the issue is something to do with the module_name and module_url item.find_elements_by_xpath because I can verify through debugging that li_items does contain the <li> tags I'm trying to find but that xpath method for module_url and module_name just isn't working correctly.
Oh ok than its hard to tell without looking at live page where html is rendered.
Edit number 2 of my question has a link to the html, not a full page though. Just the top level div child of the html tag which includes the <li> tags bold about quarter way down the page and the info I need is underlined. I've just realised that might not be as useful as the the full HTML because it cant be uaed with devloper tools in a browser
|
1

This is actually very easy with BeautifulSoup. Here is how u do it using BeautifulSoup:

from bs4 import BeautifulSoup
html = """
<ul id="content_listContainer" class="contentList">
    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>

    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>
</ul>
"""
soup = BeautifulSoup(html,'html.parser')

lis = soup.find_all('li',class_ = 'clearfix liItem read')

for li in lis:
    print(li.div.h3.a['href'])

Output:

Link To Module
Link To Module

Hope that this helps!

EDIT:

Since ur website is dynamically loaded using javascript, u shd first open the url in selenium, get the html code of the website and close the browser. Here is how u do it:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source

U can then parse this html using BeautifulSoup. Hope that this helps!

7 Comments

You can grab the source with selenium and pass it to BS4. Just replace html with driver.page_source.
Exactly! This was the exact same thing that I mentioned in the edit.
Oh...Could u provide the url of the actual website that u r tryna scrape? If u can't/u r not allowed to do it, then the entire html of the website would do good.
It would require logging in to view this page. I will make an edit to the question with a link to the dumped HTML.
@Sushil I just realised the HTML I posted isn't the most helpful because it cant be used with developer tools. I thought making it smaller would help. I'll make an edit soon with the full html
|
1

You should be able to use css selectors and avoid a loop.

import pandas as pd

results = pd.DataFrame(zip([i.text for i in driver.find_elements_by_css_selector('#content_listContainer span')]
                           , [i.get_attribute('href') for i in driver.find_elements_by_css_selector.('#content_listContainer a')])
                           , columns = ['Name', 'Link'])

print(results)

3 Comments

I attempted this solution but it was returning individual characters instead for some reason out of my knowledge
That would happen if treating .text as list but that shouldn't have happened as zip is combining two lists generated by list comprehensions and the .text and href are within the list comprehensions.
you could set [i.text for i in driver.find_elements_by_css_selector('#content_listContainer span')] into a variable and [i.get_attribute('href') for i in driver.find_elements_by_css_selector.('#content_listContainer a')] into a variable and then zip and call pd.DataFrame on it.
0

I've just ran into a very similar issue and whilst I'm not exactly sure as to why, I think I've found a solution:

If you replace

module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')

with

module_url = item.find_element_by_xpath("./div[@class='item clearfix']/h3/a").get_attribute('href')

as in, replace the // with ./ at the start of your xpath (and make the same substitution in the module_name xpath), then I think it should work. I tried it against the html you provided and it seems to work. Again, really not sure why it works, I've tried looking into the XPath docs but it's all Greek to me honestly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.