I have a Chinese document, but in the document there are a lot of error strings left due to error in decoding, they all look like fffd , ff10 or something.
Now I need to remove all the occurrence of those error strings, so I need to know the pattern for them, but I can't find useful information. All I SEEM TO know now is they consists of 4 characters, and they start with 'ff', but the last two are uncertain.
For example, the error string may look like: 300dfffd or afffdnormalff0cword.
What I want for the two words above are: 300d and anormalword.
I can not delete all the four letter pattern starts with ff since there are normal words start with them.
Is there a single re pattern that can represent them? Or is there any other way recommended? Thanks.
BTW, I'm doing this in Python, so any Pythonic way is highly appreciated!
Thanks.
UPDATE:
I ended up using pattern ff(fd|\d\w|\w\d) and removed almost all of the errors.
Some errors such as ff07 and ff50 are not removed which is strange since they should have been removed by the re pattern, but that little amount of errors is within my tolerance.