I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.
I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.
Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:
vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")
vacancydetails = []
for index, vacancy in enumerate(vacancy_headings, start=0):
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
vacancypage_client = urlopen(vacancypage_url)
vacancypage_html = vacancypage_client.read()
vacancypage_soup = soup(vacancypage_html, "html.parser")
vacancydetails[index]=[]
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
cells = p.text
vacancydetails[index].append(cells)`
But I get the following error message:
IndexError Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>()
9 vacancypage_html = vacancypage_client.read()
10 vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11 vacancydetails[index]=[]
12
13 for p in vacancypage_soup.select("p"):
IndexError: list assignment index out of range
Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?
Thanks!!
vacancydetails.append([])?enumerate(), by default, starts from 0; so you don't need to explicitly dostart=0.