0

I'm trying to get a content of a web page and parse it than save in mysql db.

I actually did it for a web page encoding utf8.

But when i tried with a 8859-9 encoding webpage i get error.

My code to get content of page:

def getcontent(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Magic Browser')]
    opener.addheaders = [('Accept-Charset', 'utf-8')]   
    #print chardet.detect(response).get('encoding)
    response = opener.open(url).read()
    opener.close()
    return response



url     = "http://www.meb.gov.tr/duyurular/index.asp?ID=4"
contentofpage = getcontent(url)
print contentofpage
print chardet.detect(contentofpage)
print contentofpage.encode("utf-8")

output of content of page: ... E�itim Teknolojileri Genel M�d�rl��� ...

{'confidence': 0.7789909202570836, 'encoding': 'ISO-8859-2'}


Traceback (most recent call last):
  File "meb.py", line 18, in <module>
    print contentofpage.encode("utf-8")
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not     in range(128)

Actually page is a Turkish page and encoding is 8859-9.

When i tried with default encoding all i see ��� instead of some chars. How can i take or convert content of page to utf-8 or turkish (iso-8859-9)

Also when i use unicode(contentofpage)

it get

Traceback (most recent call last): File "meb.py", line 20, in print unicode(contentofpage) UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not in range(128)

any help ?

1 Answer 1

3

I think you want to decode, not encode, since it is already encoded.

print contentofpage.decode("iso-8859-9")

yields a sample like:

Eğitim Teknolojileri Genel Müdürlüğü
Sign up to request clarification or add additional context in comments.

2 Comments

print contentofpage.decode("iso-8859-9") UnicodeEncodeError: 'ascii' codec can't encode character u'\xee' in position 458: ordinal not in range(128)
Make sure you are decoding directly after getting the content. contentofpage = getcontent(url), then print contentofpage.decode('iso-8859-9').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.