51

So, I have this code:

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

But then python returns this error:

links = linkregex.findall(msg)
TypeError: can't use a string pattern on a bytes-like object

What did I do wrong?

1
  • Which version of Python are you running? Commented Mar 3, 2011 at 17:52

6 Answers 6

70

TypeError: can't use a string pattern on a bytes-like object

what did i do wrong??

You used a string pattern on a bytes object. Use a bytes pattern instead:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

(ps:

 >>> from disclaimer include dont_use_regexp_on_html
 "Use BeautifulSoup or lxml instead."

)

Sign up to request clarification or add additional context in comments.

1 Comment

Will it break with python2?
3

If you are running Python 2.6 then there isn't any "request" in "urllib". So the third line becomes:

m = urllib.urlopen(url) 

And in version 3 you should use this:

links = linkregex.findall(str(msg))

Because 'msg' is a bytes object and not a string as findall() expects. Or you could decode using the correct encoding. For instance, if "latin1" is the encoding then:

links = linkregex.findall(msg.decode("latin1"))

2 Comments

He says in the comments that he's running 3.1.3, so there is a request.
Indeed, saw that afterwards. So I added the solution for version 3 as well.
1

Well, my version of Python doesn't have a urllib with a request attribute but if I use "urllib.urlopen(url)" I don't get back a string, I get an object. This is the type error.

8 Comments

Here is the link to docs backing this up: docs.python.org/library/urllib.html#urllib.urlopen
Those are docs for 2.7. The OP says in the comments that he's using 3.1.3.
John, read the docs. The API is still the same.
My point is, your version doesn't have the request attribute, but the OP's version does. You are correct on the cause of the type error.
Yeah, the version was mentioned after I put my answer up. ;)
|
1

The url you have for Google didn't work for me, so I substituted http://www.google.com/ig?hl=en for it which works for me.

Try this:

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

Hope this helps.

1 Comment

This only works if your system Python default encoding is the same as the web pages encoding.
1

The regular expression pattern and string have to be of the same type. If you're matching a regular string, you need a string pattern. If you're matching a byte string, you need a bytes pattern.

In this case m.read() returns a byte string, so you need a bytes pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify a byte string literal:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

Comments

0

That worked for me in python3. Hope this helps

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

And also this in which i added b before regex to convert it into byte array.

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.