Regex on unicode string

Question

I am trying to download a few hundred Korean pages like this one:

http://homeplusexpress.com/store/store_view.asp?cd_express=3

For each page, I want to use a regex to extract the "address" field, which in the above page looks like:

＊주소 : 서울시 광진구 구의1동 236-53

So I do this:

>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)

I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:

address = re.search('주소', html)

However, re.search is returning None. I tried with and without the u prefix on the regex string.

Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?

What do you get if you enter '주소' in the shell like you did with html? I get '\xec\xa3\xbc\xec\x86\x8c', which re can use no problem. — Rob Watts
– Rob Watts, Commented May 9, 2014 at 20:47
@RobWatts: IDLE gives me: "Unsupported characters in input". PowerShell displays them as boxes and Python evaluates them to '??' — RexE
– RexE, Commented May 9, 2014 at 20:48
The page uses the euc-kr encoding, which is different from utf-8. — John Smith Optional
– John Smith Optional, Commented May 9, 2014 at 20:48

John Smith Optional · Accepted Answer · 2014-05-09 21:12:00Z

According to the tag in the html document header:

<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

the web page uses the euc-kr encoding.

I wrote this code:

# -*- coding: euc-kr -*-

import re

import requests

resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text

address = re.search('주소', html)

print address

Then I saved it in gedit using the euc-kr encoding.

I got a match.

But actually there is an even better solution! You can keep the utf-8 encoding for your files.

# -*- coding: utf-8 -*-

import re

import requests

resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')

resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the 
# requests library couldn't detect it correctly

html = resp.text
# now the html variable contains a utf-8 encoded unicode instance

print type(html)

# we use the re.search functions with unicode strings
address = re.search(u'주소', html)

print address

GreyZmeem · Accepted Answer · 2014-05-09 21:09:11Z

0

From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers

If you check your website, we can see there is no encoding in server response: enter image description here

I think the only option in this case is directly specify what encoding to use:

# -*- coding: utf-8 -*-

import requests
import re

r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

answered May 9, 2014 at 21:09

GreyZmeem

6601 gold badge7 silver badges11 bronze badges

Collectives™ on Stack Overflow

Regex on unicode string

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related