2

I am reading a webpage content and checking for a word with umlauts. The word is present in the page content. But the python find('ü') function is not finding the word.

import urllib2
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find('ü')

I have tried to convert the search string with u'ü'. Then the error is

'SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xfc in position 0'

I have used # -- coding: utf-8 -- in my .py file.

I have print the page_content. There the umlaut ü is converting to 'ü'. If I try with page_content.find('ü'), it is working fine. Please let me know if there is any better solution for this.

I would greatly appreciate any suggestions.

4
  • What editor are you using? When you save the file make sure you save it in UTF-8 encoding (almost all editors have this option). The fact that you use coding: utf-8 at the beginning of the file tells the interpreter you will be using utf-8, but that doesn't make the file utf-8 endcoded unless you make it yourself. Commented Jul 26, 2012 at 11:33
  • Check the position of the coding line - it must be first or second line of the file Commented Jul 26, 2012 at 11:37
  • @MariaZverina That won't work... Even though he'll won't longer get the error, page_content.find('ü') will always return -1, even though the page does contain ü. As said above, he must save the file saved in UTF-8 in order for things to work. The coding declaration by itself isn't sufficient. Commented Jul 26, 2012 at 11:46
  • @IoanAlexandruCucu page_content.find(u'ü') should work though ... Commented Jul 26, 2012 at 12:01

2 Answers 2

2

Your Python tries to parse the source file (or console input) as UTF-8, but it's actually encoded in Latin-1. You could try to put a

# coding: iso-8859-1

comment at the top of the source file, or better, use an editor/terminal emulator that supports UTF-8 and save your scripts in that encoding.

Sign up to request clarification or add additional context in comments.

1 Comment

Or even better, you could keep the coding: utf-8 and actually save the file in UTF-8 rather than Latin-1
0

If you define UTF-8 encoding at the top of the file as follows things should work. Please note that the coding line must be either first line, or second line after the hashbang.

#!/usr/bin/python
# coding: utf-8

import urllib2

url = 'http://en.wikipedia.org/wiki/Germanic_umlaut'
opener = urllib2.build_opener()
page_content = opener.open(url).read() 
page_content.find(u'ü')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.