0

I'm trying to scrape a website where unicode characters are present. I stated at the very begining -*- coding: utf-8 -*- plus I used the re.UNICODE flag

pattern = re.compile('(?:{}|{})'.format(regex, regex1), re.UNICODE)

However when I print the output I still get those weird chars like

How do I fix that? Thanks!

3
  • 1
    You might get the � glyph simply because your font doesn't support the respective Unicode character. Commented Mar 25, 2013 at 23:16
  • It does a hundred percent. Commented Mar 26, 2013 at 7:44
  • 1
    You have to decode the UTF-8 text from the website first. See this question, for example. Commented Mar 27, 2013 at 0:48

2 Answers 2

4

Just because a page it has non-latin character doesn't mean it's encoded with unicode (also, which unicode encoding? utf-8? utf-16?).

Additionally, re.UNICODE probably doesn't do what you think it does. From the docs:

Make `\w, \W, \b, \B, \d, \D, \s` and `\S` dependent on the Unicode character properties database.

All this means is that these specific character classes are more broadly defined, it has no effect on the source text.

Moreover, the coding definition, -*- coding: utf-8 -*- is only specifying the encoding of your source file.

Finally, as noted in one of the comments, the � can be the result of using a character which is not supported by the current typeface. This, in turn, can be the result of assuming a certain encoding while the text is encoded in a different encoding.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks but I'm still struggling. I'm certain it's utf-8
1

This may not be an "answer", per-se.. but you could try using http://www.debuggex.com to debug your regexp a bit.

2 Comments

You should leave this as a comment instead of an answer then.
Not sure why (probably because my stackoverflow reputation isn't high enough?) but I don't seem to have the option to leave comments on anything except my own answer... that doesn't seem right though...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.