0

I am new in python and would like to learn web scraping with python. My first project are the yellow pages in Germany.

When executing my code, I am getting following IndexError after scraping 12 pages:

('Traceback (most recent call last): File "C:/Users/Zorro/PycharmProjects/scraping/venv/Lib/site-packages/pip-19.0.3-py3.6.egg/pip/_vendor/pytoml/test.py", line 25, in city = city_container[0].text.strip() IndexError: list index out of range

Process finished with exit code 1')

I would like to know how I can skip this error, so that python does not stop scraping.

I tried to use try and except blocks, but did not succeed.

from bs4 import BeautifulSoup as soup
import requests


page_title = "/Seite-"
page_number = 1

for i in range(25):

my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"

page_html = requests.get(my_url + page_title + str(page_number))
page_soup = soup(page_html.text, "html.parser")

containers = page_soup.findAll("div", {"class": "table"})

for container in containers:
    name_container = container.findAll("div", {"class": "h2"})
    name = name_container[0].text.strip()

    street_container = container.findAll("span", {"itemprop": "streetAddress"})
    street = street_container[0].text.strip()

    city_container = container.findAll("span", {"itemprop": "addressLocality"})
    city = city_container[0].text.strip()

    plz_container = container.findAll("span", {"itemprop": "postalCode"})
    plz_name = plz_container[0].text.strip()

    tele_container = container.findAll("li", {"class": "phone"})
    tele = tele_container[0].text.strip()

    print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
    print()

page_number += 1
1

1 Answer 1

2

Ok, the formatting seems to have suffered a little upon posting the code. Two things:

1) When webscraping it is usually advisable to add some downtime between consecutive scrapes to not get thrown off the server and not block too many resources. I added time.sleep(5) between every page request to wait 5 seconds before loading another page.

2) For me, try except worked just fine, if you add pass to the exception part. Of course, you can become more sophisticated in treating exceptions.

from bs4 import BeautifulSoup as soup
import requests
import time


page_title = "/Seite-"
page_number = 1

for i in range(25):
    print(page_number)
    time.sleep(5)
    my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"

    page_html = requests.get(my_url + page_title + str(page_number))
    page_soup = soup(page_html.text, "html.parser")

    containers = page_soup.findAll("div", {"class": "table"})

    for container in containers:

        try:
            name_container = container.findAll("div", {"class": "h2"})
            name = name_container[0].text.strip()

            street_container = container.findAll("span", {"itemprop": "streetAddress"})
            street = street_container[0].text.strip()

            city_container = container.findAll("span", {"itemprop": "addressLocality"})
            city = city_container[0].text.strip()

            plz_container = container.findAll("span", {"itemprop": "postalCode"})
            plz_name = plz_container[0].text.strip()

            tele_container = container.findAll("li", {"class": "phone"})
            tele = tele_container[0].text.strip()

            print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
            print()

        except:
            pass

    page_number += 1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.