Python URL encoding with umlauts error

Question

I am reading a webpage content and checking for a word with umlauts. The word is present in the page content. But the python find('ü') function is not finding the word.

import urllib2
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find('ü')

I have tried to convert the search string with u'ü'. Then the error is

'SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xfc in position 0'

I have used # -- coding: utf-8 -- in my .py file.

I have print the page_content. There the umlaut ü is converting to 'ü'. If I try with page_content.find('ü'), it is working fine. Please let me know if there is any better solution for this.

I would greatly appreciate any suggestions.

What editor are you using? When you save the file make sure you save it in UTF-8 encoding (almost all editors have this option). The fact that you use coding: utf-8 at the beginning of the file tells the interpreter you will be using utf-8, but that doesn't make the file utf-8 endcoded unless you make it yourself. — Ioan Alexandru Cucu
– Ioan Alexandru Cucu, Commented Jul 26, 2012 at 11:33
Check the position of the coding line - it must be first or second line of the file — Maria Zverina
– Maria Zverina, Commented Jul 26, 2012 at 11:37
@MariaZverina That won't work... Even though he'll won't longer get the error, page_content.find('ü') will always return -1, even though the page does contain ü. As said above, he must save the file saved in UTF-8 in order for things to work. The coding declaration by itself isn't sufficient. — Ioan Alexandru Cucu
– Ioan Alexandru Cucu, Commented Jul 26, 2012 at 11:46
@IoanAlexandruCucu page_content.find(u'ü') should work though ... — Maria Zverina
– Maria Zverina, Commented Jul 26, 2012 at 12:01

Fred Foo · Accepted Answer · 2012-07-26 11:49:39Z

2

Your Python tries to parse the source file (or console input) as UTF-8, but it's actually encoded in Latin-1. You could try to put a

# coding: iso-8859-1

comment at the top of the source file, or better, use an editor/terminal emulator that supports UTF-8 and save your scripts in that encoding.

edited Jul 26, 2012 at 11:49

answered Jul 26, 2012 at 11:29

Fred Foo

365k80 gold badges765 silver badges852 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ioan Alexandru Cucu Over a year ago

Or even better, you could keep the coding: utf-8 and actually save the file in UTF-8 rather than Latin-1

Maria Zverina · Accepted Answer · 2012-07-26 11:49:13Z

0

If you define UTF-8 encoding at the top of the file as follows things should work. Please note that the coding line must be either first line, or second line after the hashbang.

#!/usr/bin/python
# coding: utf-8

import urllib2

url = 'http://en.wikipedia.org/wiki/Germanic_umlaut'
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find(u'ü')

edited Jul 26, 2012 at 11:49

answered Jul 26, 2012 at 11:34

Maria Zverina

11.2k3 gold badges47 silver badges62 bronze badges

Collectives™ on Stack Overflow

Python URL encoding with umlauts error

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related