How to detect coding error strings?

Question

I have a Chinese document, but in the document there are a lot of error strings left due to error in decoding, they all look like fffd , ff10 or something.

Now I need to remove all the occurrence of those error strings, so I need to know the pattern for them, but I can't find useful information. All I SEEM TO know now is they consists of 4 characters, and they start with 'ff', but the last two are uncertain.

For example, the error string may look like: 300dfffd or afffdnormalff0cword.

What I want for the two words above are: 300d and anormalword.

I can not delete all the four letter pattern starts with ff since there are normal words start with them.

Is there a single re pattern that can represent them? Or is there any other way recommended? Thanks.

BTW, I'm doing this in Python, so any Pythonic way is highly appreciated!

Thanks.

UPDATE：

I ended up using pattern ff(fd|\d\w|\w\d) and removed almost all of the errors.

Some errors such as ff07 and ff50 are not removed which is strange since they should have been removed by the re pattern, but that little amount of errors is within my tolerance.

Can you show specific examples of the input you started with and how you are processing it? — Ned Batchelder
– Ned Batchelder, Commented Jun 27, 2012 at 10:43
They are most likely unicode. Throwing them out is unlikely to be a good solution. — John La Rooy
– John La Rooy, Commented Jun 27, 2012 at 10:45
@gnibbler You're right, they're Unicode. But for my purpose, I just need to remove them. — Derrick Zhang
– Derrick Zhang, Commented Jun 27, 2012 at 10:48

Ned Batchelder · Accepted Answer · 2012-06-27 11:02:01Z

2

Not all of the characters you talk about are errors. U+FFFD is an error replacement character, which meant that some decoding step couldn't find a character to use. U+FF0C is a full-width comma, and U+FF10 is a full-width zero, these are both perfectly valid characters, and likely ones you want to keep.

You could remove them if you like:

doc = mydoc.encode('charmap', 'ignore')

If you have specific Unicode characters you don't like, then:

bad = set(u"\ufffd\uff10\uff0c") # etc
mydoc = u"".join(c for c in mydoc if c not in bad)

edited Jun 27, 2012 at 11:02

answered Jun 27, 2012 at 10:49

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ned Batchelder Over a year ago

OK, I've updated the answer for removing arbitrary characters.

Derrick Zhang Over a year ago

The problem now is I don't know all the patterns I need to remove. Once I know how to detect them, I know how to process them.

Ned Batchelder Over a year ago

Sorry, then, you haven't told us how to know which are errors and which are real. As you discover bad characters, add then to bad.

Derrick Zhang · Accepted Answer · 2012-06-28 12:14:05Z

0

I ended up using pattern ff(fd|\d\w|\w\d) and removed all but only a few errors.

Some errors such as ff07 and ff50 are not removed which is strange since they should have been removed by the re pattern, but that little amount of errors is within my tolerance.

answered Jun 28, 2012 at 12:14

Derrick Zhang

21.6k18 gold badges56 silver badges76 bronze badges

Collectives™ on Stack Overflow

How to detect coding error strings?

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related