4

I want to build a web scraper. Currently, I'm learning Python. This is the very basics!

Python Code

import urllib.request
import re

htmlfile = urllib.request.urlopen("http://basketball.realgm.com/")

htmltext = htmlfile.read()
title = re.findall('<title>(.*)</title>', htmltext)

print (htmltext)

Error:

  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
1

2 Answers 2

5

You have to decode your data. Since the website in question says

charset=iso-8859-1

use that. utf-8 won't work in this case.

htmltext = htmlfile.read().decode('iso-8859-1')
Sign up to request clarification or add additional context in comments.

2 Comments

This worked, but I'm still confused why we had to put a decode('iso-8859-1'). Are there sites that wouldn't require that addition?
@Jtwa check the source code of the site you are trying to scrape for charset=.... For the site in your question, the charset is iso-8859-1. If none is given, your best bet would usually be utf-8.
3

Use bytes literal as pattern:

title = re.findall(b'<title>(.*)</title>', htmltext)

or decode the retrieved data to string:

title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))

(change utf-8 with appropriate encoding of the document)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.