2

I need to crawl several thousand subsites and extract information.

Now, unfortunately the information in question is not a regular HTML text, but a image with text rendered on it dynamically.

How can I extract these images to further process them? I'm using Selenium Webdriver on Python.

2
  • Any reasons for not using mechanize, requests or urllib2? Commented Sep 2, 2013 at 9:20
  • Yes, the site requires a headless browser to be used. Commented Sep 2, 2013 at 11:13

1 Answer 1

1

There are very few things that you cannot do with mechanize plus BeautifulSoup. The further processing of the images can be done with pytesser, I however have not experience there. It would be interesting to have an advise from a knowledgeable person in Python OCR stuff.

import mechanize, BeautifulSoup

browser = mechanize.Browser()
html = browser.open("http://www.dreamstime.com/free-photos")
soup = BeautifulSoup.BeautifulSoup(html)
for ii, image in enumerate(soup.findAll('img')):
    _src = image['src']
    if str(_src).startswith('http://') and str(_src).endswith('.jpg'):
        print 'Storing this image:', _src
        data = browser.open(_src).read()
        fl = 'image' + str(ii) + '.jpg'
        with open(fl, 'wb') as f:
            f.write(data)
        f.closed
Sign up to request clarification or add additional context in comments.

2 Comments

Unfortunately, requesting the image src doesn't work - the site returns "Image can not be displayed" for every request I make. That's why I need to extract the images AFTER they've been loaded in Selenium. The OCR part is already functional (yay for pytesser!) but I need these images. The OCR doesn't work with full-page screenshots.
Here's a screenshot: i.imgur.com/zLT22Tz.png I selected the address on top - I need to get that text, somehow, and full-page OCR is almost unusable.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.