Python TypeError on regex [duplicate]

Question

So, I have this code:

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

But then python returns this error:

links = linkregex.findall(msg)
TypeError: can't use a string pattern on a bytes-like object

What did I do wrong?

Which version of Python are you running?

Morten Kristensen
– Morten Kristensen

2011-03-03 17:52:04 +00:00
Commented Mar 3, 2011 at 17:52 — Morten Kristensen
– Morten Kristensen, Commented Mar 3, 2011 at 17:52

Community · Accepted Answer · 2020-06-20 09:12:55Z

70

TypeError: can't use a string pattern on a bytes-like object

what did i do wrong??

You used a string pattern on a bytes object. Use a bytes pattern instead:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

(ps:

 >>> from disclaimer include dont_use_regexp_on_html
 "Use BeautifulSoup or lxml instead."

)

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 3, 2011 at 19:23

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dilawar Over a year ago

Will it break with python2?

Morten Kristensen · Accepted Answer · 2011-03-03 18:00:37Z

3

If you are running Python 2.6 then there isn't any "request" in "urllib". So the third line becomes:

m = urllib.urlopen(url)

And in version 3 you should use this:

links = linkregex.findall(str(msg))

Because 'msg' is a bytes object and not a string as findall() expects. Or you could decode using the correct encoding. For instance, if "latin1" is the encoding then:

links = linkregex.findall(msg.decode("latin1"))

edited Mar 3, 2011 at 18:00

answered Mar 3, 2011 at 17:55

Morten Kristensen

7,6734 gold badges34 silver badges53 bronze badges

2 Comments

John Over a year ago

He says in the comments that he's running 3.1.3, so there is a request.

Morten Kristensen Over a year ago

Indeed, saw that afterwards. So I added the solution for version 3 as well.

Jeremy Whitlock · Accepted Answer · 2011-03-03 17:54:15Z

1

Well, my version of Python doesn't have a urllib with a request attribute but if I use "urllib.urlopen(url)" I don't get back a string, I get an object. This is the type error.

answered Mar 3, 2011 at 17:54

Jeremy Whitlock

3,81828 silver badges16 bronze badges

8 Comments

Jeremy Whitlock Over a year ago

Here is the link to docs backing this up: docs.python.org/library/urllib.html#urllib.urlopen

John Over a year ago

Those are docs for 2.7. The OP says in the comments that he's using 3.1.3.

Jeremy Whitlock Over a year ago

John, read the docs. The API is still the same.

John Over a year ago

My point is, your version doesn't have the request attribute, but the OP's version does. You are correct on the cause of the type error.

Jeremy Whitlock Over a year ago

Yeah, the version was mentioned after I put my answer up. ;)

|

John · Accepted Answer · 2011-03-03 18:17:13Z

1

The url you have for Google didn't work for me, so I substituted http://www.google.com/ig?hl=en for it which works for me.

Try this:

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

Hope this helps.

edited Mar 3, 2011 at 18:17

answered Mar 3, 2011 at 18:04

John

16.1k13 gold badges48 silver badges65 bronze badges

1 Comment

Lennart Regebro Over a year ago

This only works if your system Python default encoding is the same as the web pages encoding.

Seppo Enarvi · Accepted Answer · 2013-05-07 14:54:01Z

1

The regular expression pattern and string have to be of the same type. If you're matching a regular string, you need a string pattern. If you're matching a byte string, you need a bytes pattern.

In this case m.read() returns a byte string, so you need a bytes pattern. In Python 3, regular strings are unicode strings, and you need the b modifier to specify a byte string literal:

linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

answered May 7, 2013 at 14:54

Seppo Enarvi

3,7173 gold badges36 silver badges27 bronze badges

Comments

user3022012 · Accepted Answer · 2016-07-16 18:15:40Z

That worked for me in python3. Hope this helps

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

And also this in which i added b before regex to convert it into byte array.

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

Collectives™ on Stack Overflow

Python TypeError on regex [duplicate]

6 Answers 6

1 Comment

2 Comments

8 Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

2 Comments

8 Comments

1 Comment

Comments

Comments

Linked

Related