0

I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.

I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.

Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:

vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")

vacancydetails = []

for index, vacancy in enumerate(vacancy_headings, start=0):
    vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
    vacancypage_client = urlopen(vacancypage_url)
    vacancypage_html = vacancypage_client.read()
    vacancypage_soup = soup(vacancypage_html, "html.parser")
    vacancydetails[index]=[]

    for p in vacancypage_soup.select("p"):
        if p.has_attr("itemprop"):
            if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
                cells = p.text
                vacancydetails[index].append(cells)`

But I get the following error message:

IndexError                                Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>() 
      9     vacancypage_html = vacancypage_client.read()
     10     vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11     vacancydetails[index]=[]
     12 
     13     for p in vacancypage_soup.select("p"):

IndexError: list assignment index out of range

Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?

Thanks!!

2
  • 3
    vacancydetails.append([])? Commented Nov 20, 2018 at 16:03
  • enumerate(), by default, starts from 0; so you don't need to explicitly do start=0. Commented Nov 20, 2018 at 16:04

2 Answers 2

1

Since vacancydetails is a list, trying to access a position in the list that doesn't exist is an error. And, when you first create it, the list is empty. So, before accessing any elements from the list, you'll need to first create those elements.

Thus, instead of this:

    vacancydetails[index]=[]

...you want to append a new item to the list (and that new item happens to be an empty list itself), like this:

    vacancydetails.append([])
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, that's very helpful! I've got it to work now. :)
My pleasure! If you feel like your problem has been solved by one of the answers, you can help future people who might have the same problem by clicking the checkmark icon next to an answer to mark it as "accepted." (More info here.) If none of the given answers wound up being what fixed your issue, you can add your own answer.
0

The list vacancydetails is empty until you append to it (or assign to it from somewhere else). Because index is counting up from 0, you just want to manipulate the currently-final entry in vacancydetails in the for p loop.

So, rather than vacancydetails[index]=[] you want vacancydetails.append([]). But then the more pythonic thing to do is work with the last entry in vacancydetails, i.e., vacancydetails[-1], in which case you never need the index variable.

for vacancy in vacancy_headings:
    vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
    ### ...
    vacancydetails.append([])

    for p in vacancypage_soup.select("p"):
        if p.has_attr("itemprop"):
           ### ...
           vacancydetails[-1].append(cells)

2 Comments

Thanks for your quick response and explanation. The solution you suggests works great!
@SanneVelthuis, great -- please upvote good answers and accept one as correct!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.