I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "🙂 [°] \n € dsf $ ¬ 1 Ä 2 t3¥4Ú";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output: ____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
expected: _ [_] _ _ dsf _ _ 1 _ 2 t3_4_
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.
Maybe an additional problem:
In the actual problem the strings is read from a unicode file by another python module and I do not know if it handles the unicodeness correctly. So may be the string variable is marked as ascii but contains unicode sequences.
/[\u007F-\uFFFF]/, works fine in javascript..[$@~]` the whole thing can be replaced with[^\x20-\x7e]but this also will match control char's as well.