problem with accessing index from for loop and using it to create a new list

Question

I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.

I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.

Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:

vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")

vacancydetails = []

for index, vacancy in enumerate(vacancy_headings, start=0):
    vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
    vacancypage_client = urlopen(vacancypage_url)
    vacancypage_html = vacancypage_client.read()
    vacancypage_soup = soup(vacancypage_html, "html.parser")
    vacancydetails[index]=[]

    for p in vacancypage_soup.select("p"):
        if p.has_attr("itemprop"):
            if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
                cells = p.text
                vacancydetails[index].append(cells)`

But I get the following error message:

IndexError                                Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>() 
      9     vacancypage_html = vacancypage_client.read()
     10     vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11     vacancydetails[index]=[]
     12 
     13     for p in vacancypage_soup.select("p"):

IndexError: list assignment index out of range

Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?

Thanks!!

enumerate(), by default, starts from 0; so you don't need to explicitly do start=0. — Austin
– Austin, Commented Nov 20, 2018 at 16:04

s3cur3 · Accepted Answer · 2018-11-20 16:13:34Z

1

Since vacancydetails is a list, trying to access a position in the list that doesn't exist is an error. And, when you first create it, the list is empty. So, before accessing any elements from the list, you'll need to first create those elements.

Thus, instead of this:

    vacancydetails[index]=[]

...you want to append a new item to the list (and that new item happens to be an empty list itself), like this:

    vacancydetails.append([])

answered Nov 20, 2018 at 16:13

s3cur3

3,0732 gold badges33 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Sanne Velthuis Over a year ago

Thanks, that's very helpful! I've got it to work now. :)

s3cur3 Over a year ago

My pleasure! If you feel like your problem has been solved by one of the answers, you can help future people who might have the same problem by clicking the checkmark icon next to an answer to mark it as "accepted." (More info here.) If none of the given answers wound up being what fixed your issue, you can add your own answer.

Andrew Jaffe · Accepted Answer · 2018-11-20 16:16:45Z

0

The list vacancydetails is empty until you append to it (or assign to it from somewhere else). Because index is counting up from 0, you just want to manipulate the currently-final entry in vacancydetails in the for p loop.

So, rather than vacancydetails[index]=[] you want vacancydetails.append([]). But then the more pythonic thing to do is work with the last entry in vacancydetails, i.e., vacancydetails[-1], in which case you never need the index variable.

for vacancy in vacancy_headings:
    vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
    ### ...
    vacancydetails.append([])

    for p in vacancypage_soup.select("p"):
        if p.has_attr("itemprop"):
           ### ...
           vacancydetails[-1].append(cells)

edited Nov 20, 2018 at 16:16

answered Nov 20, 2018 at 16:11

Andrew Jaffe

27.2k4 gold badges54 silver badges59 bronze badges

2 Comments

Sanne Velthuis Over a year ago

Thanks for your quick response and explanation. The solution you suggests works great!

Andrew Jaffe Over a year ago

@SanneVelthuis, great -- please upvote good answers and accept one as correct!

Collectives™ on Stack Overflow

problem with accessing index from for loop and using it to create a new list

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related