Making a python loop faster

Question

Can this little routine be made faster? With the elif's it makes a comprehension get out of hand, but maybe I haven't tried it the right way.

def cleanup(s):
    strng = ''
    good = ['\t', '\r', '\n']
    for char in s:        
        if unicodedata.category(char)[0]!="C":
            strng += char
        elif char in good:
            strng += char
        elif char not in good:
            strng += ' '
    return strng

At least you could speed up alot by change elif char not in good: to else:. If you want someone to maybe find a better way then add example string, unicodedata.category and explain more what you are doing. — gaback
– gaback, Commented May 8, 2018 at 14:06
In general, some_string += some_other_string in a loop will be slow. It has quadratic complexity (although the interpreter will try to optimize it), however, you should refractor it to use a list with .append then ''.join at the end. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented May 8, 2018 at 17:10

keda · Accepted Answer · 2018-05-08 22:19:04Z

1

If I understand your task correctly, you want to replace all unicode control characters with spaces except \t, \n and \r.

Here's how to do this more efficiently with regular expressions instead of loops.

import re

# make a string of all unicode control characters 
# EXCEPT \t - chr(9), \n - chr(10) and \r - chr(13)
control_chars = ''.join(map(unichr, range(0,9) + \
                            range(11,13) + \
                            range(14,32) + \
                            range(127,160)))

# build your regular expression
cc_regex = re.compile('[%s]' % re.escape(control_chars))

def cleanup(s):
    # substitute all control characters in the regex 
    # with spaces and return the new string
    return cc_regex.sub(' ', s)

You can control which characters to include or exclude by manipulating the ranges that make up the control_chars variable. Refer to the List of Unicode characters.

EDIT: Timing results.

Just out of curiosity I ran some timing tests to see which of the three current methods are fastest.

I made three methods named cleanup_op(s) that was a copy of the OP's code; cleanup_loop(s) which is Cristian Ciupitu's answer; cleanup_regex(s) which is my code.

Here's what I ran:

from timeit import default_timer as timer

sample = u"this is a string with some characters and \n new lines and \t tabs and \v and other stuff"*1000

start = timer();cleanup_op(sample);end = timer();print end-start
start = timer();cleanup_loop(sample);end = timer();print end-start
start = timer();cleanup_regex(sample);end = timer();print end-start

The results:

cleanup_op finished in about 1.1 seconds

cleanup_loop finished in about 0.02 seconds

cleanup_regex finished in about 0.004 seconds

So, either one of the answers is a significant improvement over the original code. I think @CristianCiupitu gives a more elegant and pythonic answer while regex is still faster.

edited May 8, 2018 at 22:19

answered May 8, 2018 at 16:56

keda

5693 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Cristian Ciupitu Over a year ago

Even if re.compile has a small cache of the compiled patterns, it's probably best to move the compilation of the regulation expression outside the cleanup function, so that this step is not done at every call.

Cristian Ciupitu Over a year ago

Also if you want to do the same thing as OP, cc_regex.sub('', s) should be replaced with cc_regex.sub(' ', s) (those special characters are converted to spaces, not removed).

keda Over a year ago

@CristianCiupitu Sure, sounds good. I fixed the answer to reflect your suggestions.

Cristian Ciupitu Over a year ago

Your code deals only with control characters from categories C0 and C1. You're missing Cf (format control character), Cs (surrogate code point), Co (private-use character) and Cn (reserved unassigned code point or a noncharacter).

Cristian Ciupitu Over a year ago

I run some %timeit benchmarks too on an Intel i7-3770 CPU for 3800 chars string (200 had to be replaced). I changed my code to use the same limited control characters set as yours. On python2-2.7.14-10.fc27.x86_64 the regex code takes 69.4 µs and the translation code 414 µs. On python3-3.6.5-1.fc27.x86_64 the results were 56 µs ± 186 ns and 4.32 µs ± 4.62 ns.

|

Cristian Ciupitu · Accepted Answer · 2018-05-08 21:13:06Z

0

If I understand correctly you want to convert all the Unicode control characters to space, except the tab, carriage return and new line. You can use str.translate for this:

good = map(ord, '\t\r\n')
TBL_CONTROL_TO_SPACE = {
    i: u' '
    for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i))[0] == "C" and i not in good
}

def cleanup(s):
    return s.translate(TBL_CONTROL_TO_SPACE)

answered May 8, 2018 at 21:13

Cristian Ciupitu

21k7 gold badges56 silver badges80 bronze badges

Collectives™ on Stack Overflow

Making a python loop faster

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related