0

I have a Chinese document, but in the document there are a lot of error strings left due to error in decoding, they all look like fffd , ff10 or something.

Now I need to remove all the occurrence of those error strings, so I need to know the pattern for them, but I can't find useful information. All I SEEM TO know now is they consists of 4 characters, and they start with 'ff', but the last two are uncertain.

For example, the error string may look like: 300dfffd or afffdnormalff0cword.

What I want for the two words above are: 300d and anormalword.

I can not delete all the four letter pattern starts with ff since there are normal words start with them.

Is there a single re pattern that can represent them? Or is there any other way recommended? Thanks.

BTW, I'm doing this in Python, so any Pythonic way is highly appreciated!

Thanks.

UPDATE:

I ended up using pattern ff(fd|\d\w|\w\d) and removed almost all of the errors.

Some errors such as ff07 and ff50 are not removed which is strange since they should have been removed by the re pattern, but that little amount of errors is within my tolerance.

3
  • Can you show specific examples of the input you started with and how you are processing it? Commented Jun 27, 2012 at 10:43
  • They are most likely unicode. Throwing them out is unlikely to be a good solution. Commented Jun 27, 2012 at 10:45
  • @gnibbler You're right, they're Unicode. But for my purpose, I just need to remove them. Commented Jun 27, 2012 at 10:48

2 Answers 2

2

Not all of the characters you talk about are errors. U+FFFD is an error replacement character, which meant that some decoding step couldn't find a character to use. U+FF0C is a full-width comma, and U+FF10 is a full-width zero, these are both perfectly valid characters, and likely ones you want to keep.

You could remove them if you like:

doc = mydoc.encode('charmap', 'ignore')

If you have specific Unicode characters you don't like, then:

bad = set(u"\ufffd\uff10\uff0c") # etc
mydoc = u"".join(c for c in mydoc if c not in bad)
Sign up to request clarification or add additional context in comments.

3 Comments

OK, I've updated the answer for removing arbitrary characters.
The problem now is I don't know all the patterns I need to remove. Once I know how to detect them, I know how to process them.
Sorry, then, you haven't told us how to know which are errors and which are real. As you discover bad characters, add then to bad.
0

I ended up using pattern ff(fd|\d\w|\w\d) and removed all but only a few errors.

Some errors such as ff07 and ff50 are not removed which is strange since they should have been removed by the re pattern, but that little amount of errors is within my tolerance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.