Web scraping with Python [closed]

Question

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 6 years ago.

Improve this question

I'd like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available?

Python has several options for web scraping. I enumerated some of the options here in response to a similar question. — filippo
– filippo, Commented Jan 17, 2010 at 18:21
Why not just use the built in HTML Parser in the Python Standard Library? Certainly for a task so simple and infrequent (just once a day), I see little reason to search for any other tools. docs.python.org/2.7/library/htmlparser.html — ArtOfWarfare
– ArtOfWarfare, Commented Jul 20, 2015 at 20:31
Hope this post might be useful to somebody regarding this. A good tutorial for a beginner. samranga.blogspot.com/2015/08/web-scraping-beginner-python.html It uses beautiful soup python library for web scraping with python. — Samitha Chathuranga
– Samitha Chathuranga, Commented Aug 25, 2015 at 17:19
For future readers, you may want to have a look at this answer as well, which provides two different approaches as reagrds web scraping, using (1) Selenium and (2) BeautifulSoup with Requests. — Chris
– Chris, Commented Feb 6, 2022 at 8:24

Lesmana · Accepted Answer · 2016-01-22 08:51:37Z

198

Use urllib2 in combination with the brilliant BeautifulSoup library:

import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
    # will print date and sunrise

edited Jan 22, 2016 at 8:51

Lesmana

27.3k12 gold badges84 silver badges87 bronze badges

answered Jan 17, 2010 at 18:08

user235064

Sign up to request clarification or add additional context in comments.

9 Comments

Chiara Coetzee Over a year ago

Small comment: this can be slightly simplified using the requests package by replacing line 6 with: soup = BeautifulSoup(requests.get('example.com').text)

user235064 Over a year ago

thanks for the tip. the request package did not yet exist, when I wrote the snippet above ;-)

kmote Over a year ago

@DerrickCoetzee - your simplification raises a MissingSchema error (at least on my installation). This works: soup = BeautifulSoup(requests.get('http://example.com').text)

Chiara Coetzee Over a year ago

@kmote: that was what I typed but I forgot the backticks around the code and it converted it into a link. Thanks!

themefield Over a year ago

Note that urllib2 does not exist for Python3. another post

|

Morse · Accepted Answer · 2018-04-20 13:59:00Z

66

I'd really recommend Scrapy.

Quote from a deleted answer:

Scrapy crawling is fastest than mechanize because uses asynchronous operations (on top of Twisted).

Scrapy has better and fastest support for parsing (x)html on top of libxml2.

Scrapy is a mature framework with full unicode, handles redirections, gzipped responses, odd encodings, integrated http cache, etc.

Once you are into Scrapy, you can write a spider in less than 5 minutes that download images, creates thumbnails and export the extracted data directly to csv or json.

edited Apr 20, 2018 at 13:59

Morse

9,2197 gold badges43 silver badges69 bronze badges

answered Dec 22, 2011 at 11:12

Sjaak Trekhaak

4,96633 silver badges39 bronze badges

4 Comments

Sjaak Trekhaak Over a year ago

I didn't notice this question was already 2 years old, still feel that Scrapy should be named here in case someone else is having the same question.

user1244215 Over a year ago

Scrapy is a framework, and therefore is horrible and thinks it's more important than your project. It's a framework because of the horrible (unnecessary) limitations of Twisted.

Blender Over a year ago

@user1244215: It's a framework because frameworks are nice. If you don't want to use it as a framework, there's nothing stopping you from jamming all of your code into one file.

user636044 Over a year ago

But it does not support Python 3.x.

Morse · Accepted Answer · 2018-04-19 17:37:25Z

18

I collected together scripts from my web scraping work into this bit-bucket library.

Example script for your case:

from webscraping import download, xpath
D = download.Download()

html = D.get('http://example.com')
for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):
    cols = xpath.search(row, '/td')
    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

Output:

Sunrise: 08:39, Sunset: 16:08
Sunrise: 08:39, Sunset: 16:09
Sunrise: 08:39, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:11
Sunrise: 08:40, Sunset: 16:12
Sunrise: 08:40, Sunset: 16:13

edited Apr 19, 2018 at 17:37

Morse

9,2197 gold badges43 silver badges69 bronze badges

answered Dec 22, 2011 at 7:46

hoju

29.6k40 gold badges138 silver badges178 bronze badges

Comments

Community · Accepted Answer · 2014-04-15 09:20:58Z

11

I would strongly suggest checking out pyquery. It uses jquery-like (aka css-like) syntax which makes things really easy for those coming from that background.

For your case, it would be something like:

from pyquery import *

html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')

for tr in trs:
  tds = tr.getchildren()
  print tds[1].text, tds[2].text

Output:

5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM

edited Apr 15, 2014 at 9:20

CommunityBot

11 silver badge

answered May 21, 2013 at 4:09

scottmrogowski

2,1334 gold badges24 silver badges32 bronze badges

Comments

Shog9 · Accepted Answer · 2014-04-15 22:39:47Z

7

You can use urllib2 to make the HTTP requests, and then you'll have web content.

You can get it like this:

import urllib2
response = urllib2.urlopen('http://example.com')
html = response.read()

Beautiful Soup is a python HTML parser that is supposed to be good for screen scraping.

In particular, here is their tutorial on parsing an HTML document.

Good luck!

edited Apr 15, 2014 at 22:39

Shog9

160k36 gold badges237 silver badges242 bronze badges

answered Jan 17, 2010 at 16:13

danben

83.8k18 gold badges127 silver badges149 bronze badges

1 Comment

andrew pate Over a year ago

It might be an idea to set a maximum on the bytes read. response.read(100000000) or something so those URLs for ISO's don't fill your RAM up. Happy mining.

Community · Accepted Answer · 2014-04-15 09:20:33Z

I use a combination of Scrapemark (finding urls - py2) and httlib2 (downloading images - py2+3). The scrapemark.py has 500 lines of code, but uses regular expressions, so it may be not so fast, did not test.

Example for scraping your website:

import sys
from pprint import pprint
from scrapemark import scrape

pprint(scrape("""
    <table class="spad">
        <tbody>
            {*
                <tr>
                    <td>{{[].day}}</td>
                    <td>{{[].sunrise}}</td>
                    <td>{{[].sunset}}</td>
                    {# ... #}
                </tr>
            *}
        </tbody>
    </table>
""", url=sys.argv[1] ))

Usage:

python2 sunscraper.py http://www.example.com/

Result:

[{'day': u'1. Dez 2012', 'sunrise': u'08:18', 'sunset': u'16:10'},
 {'day': u'2. Dez 2012', 'sunrise': u'08:19', 'sunset': u'16:10'},
 {'day': u'3. Dez 2012', 'sunrise': u'08:21', 'sunset': u'16:09'},
 {'day': u'4. Dez 2012', 'sunrise': u'08:22', 'sunset': u'16:09'},
 {'day': u'5. Dez 2012', 'sunrise': u'08:23', 'sunset': u'16:08'},
 {'day': u'6. Dez 2012', 'sunrise': u'08:25', 'sunset': u'16:08'},
 {'day': u'7. Dez 2012', 'sunrise': u'08:26', 'sunset': u'16:07'}]

Umair Ayub · Accepted Answer · 2015-02-08 13:52:29Z

Make your life easier by using CSS Selectors

I know I have come late to party but I have a nice suggestion for you.

Using BeautifulSoup is already been suggested I would rather prefer using CSS Selectors to scrape data inside HTML

import urllib2
from bs4 import BeautifulSoup

main_url = "http://www.example.com"

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue

SIM · Accepted Answer · 2017-08-19 16:37:07Z

1

If we think of getting name of items from any specific category then we can do that by specifying the class name of that category using css selector:

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
    print(link.text)

This is the partial search results:

Puma, USPA, Adidas & moreUp to 70% OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals, Slippers
Philips & moreStarting ₹99LED Bulbs & Emergency Lights

edited Aug 19, 2017 at 16:37

answered Apr 30, 2017 at 15:22

SIM

22.5k6 gold badges45 silver badges116 bronze badges

Comments

Atul Chavan · Accepted Answer · 2017-03-21 15:01:19Z

0

Here is a simple web crawler, i used BeautifulSoup and we will search for all the links(anchors) who's class name is _3NFO0d. I used Flipkar.com, it is an online retailing store.

import requests
from bs4 import BeautifulSoup
def crawl_flipkart():
    url = 'https://www.flipkart.com/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    for link in soup.findAll('a', {'class': '_3NFO0d'}):
        href = link.get('href')
        print(href)

crawl_flipkart()

answered Mar 21, 2017 at 15:01

Atul Chavan

1,81418 silver badges12 bronze badges

Comments

Chris D'mello · Accepted Answer · 2018-10-22 02:05:39Z

0

Python has good options to scrape the web. The best one with a framework is scrapy. It can be a little tricky for beginners, so here is a little help.
1. Install python above 3.5 (lower ones till 2.7 will work).
2. Create a environment in conda ( I did this).
3. Install scrapy at a location and run in from there.
4. Scrapy shell will give you an interactive interface to test you code.
5. Scrapy startproject projectname will create a framework.
6. Scrapy genspider spidername will create a spider. You can create as many spiders as you want. While doing this make sure you are inside the project directory.

The easier one is to use requests and beautiful soup. Before starting give one hour of time to go through the documentation, it will solve most of your doubts. BS4 offer wide range of parsers that you can opt for. Use user-agent and sleep to make scraping easier. BS4 returns a bs.tag so use variable[0]. If there is js running, you wont be able to scrape using requests and bs4 directly. You could get the api link then parse the JSON to get the information you need or try selenium.

answered Oct 22, 2018 at 2:05

Chris D'mello

1552 silver badges11 bronze badges

1 Comment

tripleee Over a year ago

Whether or not you use Anaconda is completely irrelevant here. Creating a virtual environment is basically always a good idea, but you don't need conda for that.

Collectives™ on Stack Overflow

Web scraping with Python [closed]

10 Answers 10

9 Comments

4 Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

9 Comments

4 Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

1 Comment

Linked

Related