Finding regex, unicode patterns

Question

I'm trying to scrape a website where unicode characters are present. I stated at the very begining -*- coding: utf-8 -*- plus I used the re.UNICODE flag

pattern = re.compile('(?:{}|{})'.format(regex, regex1), re.UNICODE)

However when I print the output I still get those weird chars like �

How do I fix that? Thanks!

You might get the � glyph simply because your font doesn't support the respective Unicode character. — nwellnhof
– nwellnhof, Commented Mar 25, 2013 at 23:16
You have to decode the UTF-8 text from the website first. See this question, for example. — nwellnhof
– nwellnhof, Commented Mar 27, 2013 at 0:48

beerbajay · Accepted Answer · 2013-03-25 23:25:41Z

4

Just because a page it has non-latin character doesn't mean it's encoded with unicode (also, which unicode encoding? utf-8? utf-16?).

Additionally, re.UNICODE probably doesn't do what you think it does. From the docs:

Make `\w, \W, \b, \B, \d, \D, \s` and `\S` dependent on the Unicode character properties database.

All this means is that these specific character classes are more broadly defined, it has no effect on the source text.

Moreover, the coding definition, -*- coding: utf-8 -*- is only specifying the encoding of your source file.

Finally, as noted in one of the comments, the � can be the result of using a character which is not supported by the current typeface. This, in turn, can be the result of assuming a certain encoding while the text is encoded in a different encoding.

answered Mar 25, 2013 at 23:25

beerbajay

20.5k8 gold badges63 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

nutship Over a year ago

Thanks but I'm still struggling. I'm certain it's utf-8

relic · Accepted Answer · 2013-03-25 22:19:30Z

1

This may not be an "answer", per-se.. but you could try using http://www.debuggex.com to debug your regexp a bit.

answered Mar 25, 2013 at 22:19

relic

1,7241 gold badge16 silver badges26 bronze badges

2 Comments

beerbajay Over a year ago

You should leave this as a comment instead of an answer then.

relic Over a year ago

Not sure why (probably because my stackoverflow reputation isn't high enough?) but I don't seem to have the option to leave comments on anything except my own answer... that doesn't seem right though...

Collectives™ on Stack Overflow

Finding regex, unicode patterns

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related