1

Can this little routine be made faster? With the elif's it makes a comprehension get out of hand, but maybe I haven't tried it the right way.

def cleanup(s):
    strng = ''
    good = ['\t', '\r', '\n']
    for char in s:        
        if unicodedata.category(char)[0]!="C":
            strng += char
        elif char in good:
            strng += char
        elif char not in good:
            strng += ' '
    return strng
2
  • 2
    At least you could speed up alot by change elif char not in good: to else:. If you want someone to maybe find a better way then add example string, unicodedata.category and explain more what you are doing. Commented May 8, 2018 at 14:06
  • In general, some_string += some_other_string in a loop will be slow. It has quadratic complexity (although the interpreter will try to optimize it), however, you should refractor it to use a list with .append then ''.join at the end. Commented May 8, 2018 at 17:10

2 Answers 2

1

If I understand your task correctly, you want to replace all unicode control characters with spaces except \t, \n and \r.

Here's how to do this more efficiently with regular expressions instead of loops.

import re

# make a string of all unicode control characters 
# EXCEPT \t - chr(9), \n - chr(10) and \r - chr(13)
control_chars = ''.join(map(unichr, range(0,9) + \
                            range(11,13) + \
                            range(14,32) + \
                            range(127,160)))

# build your regular expression
cc_regex = re.compile('[%s]' % re.escape(control_chars))

def cleanup(s):
    # substitute all control characters in the regex 
    # with spaces and return the new string
    return cc_regex.sub(' ', s)

You can control which characters to include or exclude by manipulating the ranges that make up the control_chars variable. Refer to the List of Unicode characters.

EDIT: Timing results.

Just out of curiosity I ran some timing tests to see which of the three current methods are fastest.

I made three methods named cleanup_op(s) that was a copy of the OP's code; cleanup_loop(s) which is Cristian Ciupitu's answer; cleanup_regex(s) which is my code.

Here's what I ran:

from timeit import default_timer as timer

sample = u"this is a string with some characters and \n new lines and \t tabs and \v and other stuff"*1000

start = timer();cleanup_op(sample);end = timer();print end-start
start = timer();cleanup_loop(sample);end = timer();print end-start
start = timer();cleanup_regex(sample);end = timer();print end-start

The results:

cleanup_op finished in about 1.1 seconds

cleanup_loop finished in about 0.02 seconds

cleanup_regex finished in about 0.004 seconds

So, either one of the answers is a significant improvement over the original code. I think @CristianCiupitu gives a more elegant and pythonic answer while regex is still faster.

Sign up to request clarification or add additional context in comments.

6 Comments

Even if re.compile has a small cache of the compiled patterns, it's probably best to move the compilation of the regulation expression outside the cleanup function, so that this step is not done at every call.
Also if you want to do the same thing as OP, cc_regex.sub('', s) should be replaced with cc_regex.sub(' ', s) (those special characters are converted to spaces, not removed).
@CristianCiupitu Sure, sounds good. I fixed the answer to reflect your suggestions.
Your code deals only with control characters from categories C0 and C1. You're missing Cf (format control character), Cs (surrogate code point), Co (private-use character) and Cn (reserved unassigned code point or a noncharacter).
I run some %timeit benchmarks too on an Intel i7-3770 CPU for 3800 chars string (200 had to be replaced). I changed my code to use the same limited control characters set as yours. On python2-2.7.14-10.fc27.x86_64 the regex code takes 69.4 µs and the translation code 414 µs. On python3-3.6.5-1.fc27.x86_64 the results were 56 µs ± 186 ns and 4.32 µs ± 4.62 ns.
|
0

If I understand correctly you want to convert all the Unicode control characters to space, except the tab, carriage return and new line. You can use str.translate for this:

good = map(ord, '\t\r\n')
TBL_CONTROL_TO_SPACE = {
    i: u' '
    for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i))[0] == "C" and i not in good
}

def cleanup(s):
    return s.translate(TBL_CONTROL_TO_SPACE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.