I'm currently web scraping my university webpage to download unit content. I've figured out how to collect the names/links to each unit, and am now currently trying to figure out how to collate the names/links to each individual module within a unit.
A rough description of the HTML on the modules page.
<ul id="content_listContainer" class="contentList">
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
</ul>
So I am trying to grab the link inside the href attribute of the <a> tag within li/div/h3 and the name of the module within the span inside the <a> tag. Here is the relevant code snippet.
modules = []
driver.get(unit_url)
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']") #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']") #Grab each li item
for item in li_items[1:]: #Skips first li tag as that is the Overview, not a module
module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')
#These are not moving on from the first module for some reason...
module_name = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a/span").text
module = {
"name": module_name,
"url": module_url
}
modules.append(module)
The issue/question:
Edit
I've tried @sushii and @QHarr solutions with no luck unfortunately. I should point out that the lines grabbing module_name and module_url within the for loop are returning the same first module data every LOOP. I've tested it with a different unit where the first couple <li> tags are non-modules (introduction) and that should be returned but it is still only returning the same module 1.
/edit
Edit 2
Here is a link to the html I am trying to scrape. This isn't the entire page as that would be way to big.
<html><body><div></div><div></div><div></div><div> This is the DIV that is in the link </div><div></div><div></div></body></html>
I have verified that li_items definitely contains the <li> tags I need so the other HTML shouldn't be important (I think).
If you scroll about a quarter way down the <li> tags I need are bolded and the information I need to scrape is underlined.
/Edit 2
The lines that grab the module_name and module_url within the for loop are only grabbing the info for the first module.
I have verified through debugging that li_items does contain all the li items and is not just grabbing the first one. I'm new to Selenium so my thinking is that there is something wrong with the xpath I have provided but it should only be grabbing the tags within the item iterable object. So I am confused as to why it keep grabbing the first li item's info.
Answer Edit
Using @Sariq Shaikh 's answer I've solved the issue. Initially his technique using indexing [] of the elements to iterate over the <li> tags wasn't working but after altering the XPATH used for module_url and module_name to include the <ul> tag and then using indexing with the <li> tag has solved my issue.
However I still do not undestand why the original method was not working. Here is the altered code.
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")
ctr = 1
for _ in module_ul.find_elements_by_tag_name('li'):
try:
module_url = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a').get_attribute('href') #These are not moving on from the first module for some reason...
module_name = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a/span').text
except SelException.NoSuchElementException:
print("NoSuchElementException\n")
ctr += 1
continue