Just because a page it has non-latin character doesn't mean it's encoded with unicode (also, which unicode encoding? utf-8? utf-16?).
Additionally, re.UNICODE probably doesn't do what you think it does. From the docs:
Make `\w, \W, \b, \B, \d, \D, \s` and `\S` dependent on the Unicode character properties database.
All this means is that these specific character classes are more broadly defined, it has no effect on the source text.
Moreover, the coding definition, -*- coding: utf-8 -*- is only specifying the encoding of your source file.
Finally, as noted in one of the comments, the � can be the result of using a character which is not supported by the current typeface. This, in turn, can be the result of assuming a certain encoding while the text is encoded in a different encoding.