0

I would like to scrape the following website using python and need to export scraped data into a CSV file:

http://www.swisswine.ch/en/producer?search=&&

This website consist of 154 pages to relevant search. I need to call every pages and want to scrape data but my script couldn't call next pages continuously. It only scrape one page data.

Here I assign value i<153 therefore this script run only for the 154th page and gave me 10 data. I need data from 1st to 154th page

How can I scrape entire data from all page by once I run the script and also how to export data as CSV file??

my script is as follows

import csv
import requests
from bs4 import BeautifulSoup
i = 0
while i < 153:       
     url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
     r = requests.get(url)
     i=+1
     r.content

soup = BeautifulSoup(r.content)
print (soup.prettify())


g_data = soup.find_all("ul", {"class": "contact-information"})
for item in g_data:
      print(item.text)
2
  • The lines that scrape the data: from soup = .... down, should be inside the loop. Otherwise you finish the loop and are getting the data only of the last one after the loop. Commented Jul 24, 2016 at 14:54
  • @vishnu It is good to use BeautifulSoup. But if you are looking for whole things to manage well, you should go for doc.scrapy.org/en/latest/intro/tutorial.html Commented Jul 24, 2016 at 14:58

1 Answer 1

1

You should put your HTML parsing code to under the loop as well. And you are not incrementing the i variable correctly (thanks @MattDMo):

import csv
import requests
from bs4 import BeautifulSoup

i = 0
while i < 153:       
     url = ("http://www.swisswine.ch/en/producer?search=&&&page=" + str(i))
     r = requests.get(url)
     i += 1 

    soup = BeautifulSoup(r.content)
    print (soup.prettify())

    g_data = soup.find_all("ul", {"class": "contact-information"})
    for item in g_data:
          print(item.text)

I would also improve the following:

  • use requests.Session() to maintain a web-scraping session, which will also bring a performance boost:

    if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

  • be explicit about an underlying parser for BeautifulSoup:

    soup = BeautifulSoup(r.content, "html.parser")  # or "lxml", or "html5lib"
    
Sign up to request clarification or add additional context in comments.

2 Comments

You missed one small detail - in the while loop, i is incremented as i =+1. It should be i += 1.
@MattDMo ah, I felt something wrong about that but lacking morning coffee. Good catch! Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.