1

I am trying to download a few hundred Korean pages like this one:

http://homeplusexpress.com/store/store_view.asp?cd_express=3

For each page, I want to use a regex to extract the "address" field, which in the above page looks like:

*주소 : 서울시 광진구 구의1동 236-53

So I do this:

>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)

I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:

address = re.search('주소', html)

However, re.search is returning None. I tried with and without the u prefix on the regex string.

Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?

3
  • What do you get if you enter '주소' in the shell like you did with html? I get '\xec\xa3\xbc\xec\x86\x8c', which re can use no problem. Commented May 9, 2014 at 20:47
  • @RobWatts: IDLE gives me: "Unsupported characters in input". PowerShell displays them as boxes and Python evaluates them to '??' Commented May 9, 2014 at 20:48
  • 2
    The page uses the euc-kr encoding, which is different from utf-8. Commented May 9, 2014 at 20:48

2 Answers 2

2

According to the tag in the html document header:

<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

the web page uses the euc-kr encoding.

I wrote this code:

# -*- coding: euc-kr -*-

import re

import requests

resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text

address = re.search('주소', html)

print address

Then I saved it in gedit using the euc-kr encoding.

I got a match.

But actually there is an even better solution! You can keep the utf-8 encoding for your files.

# -*- coding: utf-8 -*-

import re

import requests

resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')

resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the 
# requests library couldn't detect it correctly

html = resp.text
# now the html variable contains a utf-8 encoded unicode instance

print type(html)

# we use the re.search functions with unicode strings
address = re.search(u'주소', html)

print address
Sign up to request clarification or add additional context in comments.

Comments

0

From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers

If you check your website, we can see there is no encoding in server response: enter image description here

I think the only option in this case is directly specify what encoding to use:

# -*- coding: utf-8 -*-

import requests
import re

r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.