0

While working on a script to correct formatting errors from documents produced by OCR, I ran into an issue where depending on which loop I run first, the program runs about 80% slower.

Here is a simplified version of the code. I have the following loops to check for uppercase errors (e.g. "posSible"):

def fixUppercase(doc):
    fixedText = ''
    for line in doc.split('\n'):
        fixedLine = ''
        for word in line.split():
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' ' 
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line.split()[-1]:
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' ' 
        fixedText += fixedLine  
    
    return fixedText

The following loop checks for and removes headings:

def headingsFix(doc):
    fixedText = ''
    count = 0
    stopWords = ['on', 'and', 'of', 'as', 'for']
    for line in doc.split('\n'):
        tokenLine = ''
        for word in line.split():
            if word not in stopWords:
                tokenLine += word + " "
        if tokenLine.istitle() and (
            not line.endswith('.') 
            and not line.endswith(',')
            and not line.endswith(')')
            and not line.endswith(';')
            and not line.endswith(':')
        ):

            count += 1
        else:
            fixedText += line
            
    return fixedText

It's the loop in the fixedUppercase function that massively slows down. If I run any other function or loop prior to that one or If I run that one first or remove it entirely, the program is quick. Same behavior if both loops are part of one function.

I thought maybe another function or loop was causing the error by expanding the length of the document, but a check with len() shows same doc size either way.

11
  • Could you be a bit more explicit about what you mean by "...massively slows down if I run any other function or loop prior to that one"? Could you perhaps give a specific example of running another function first that makes that one measurably slower? Commented Jun 13, 2021 at 0:26
  • You'll find some helpful guidelines for creating a minimal reproducible example here: stackoverflow.com/help/minimal-reproducible-example To give you some idea of the problems facing anyone trying to answer your question: 1. we don't know your input - does it have long lines? many lines? many words? 2. we don't know how to run your code to reproduce your problem - what other function should we run prior, and how should we do so, which inputs and outputs go where? Commented Jun 13, 2021 at 0:35
  • Hi, I did give an example: if I run headingsFix() before fixUppercase(), the latter takes about 80% longer to complete. P.S. I know I could accomplish the same with a regular expression, I'm just curious as to why the speed diff. Commented Jun 13, 2021 at 0:35
  • Hi Weeble, the input is string. You can try it with any string. Obv. the smaller the string the smaller the time and diff. With 300k words, it's several minutes. Just pass a string to one function then another, time it, do it again with order reversed. Commented Jun 13, 2021 at 0:40
  • without running it, this is probably slow because it does a lot of string building, which is very slow in Python because strings are immutable and so it must create a new one each time! you can use timeit to compare the speeds of functions in Python! Commented Jun 13, 2021 at 0:43

2 Answers 2

2

headingsFix strips out all the line endings, which you presumably did not intend. However, your question is about why changing the order of transformations results in slower execution, so I'll not discuss fixing that here.

fixUppercase is extremely inefficient at handling lines with many words. It repeatedly calls line.split() over and over again on the entire book-length string. That isn't terribly slow if each line has maybe a dozen words, but it gets extremely slow if you have one enormous line with tens of thousands of words. I found your program runs vastly faster with this change to only split each line once. (I note that I can't say whether your program is correct as it stands, just that this change should have the same behaviour while being a lot faster. I'm afraid I don't particularly understand why it's comparing each word to see if it's the same as the last word on the line.)

def fixUppercase(doc):
    fixedText = ''
    for line in doc.split('\n'):
        line_words = line.split()   # Split the line once here.
        fixedLine = ''
        for word in line_words:
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line_words[-1]:   # No need to split here.
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line_words[-1]:   # No need to split here.
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line_words[-1]:   # No need to split here.
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
        fixedText += fixedLine

    return fixedText

Here you can see my timings. I download 'Alice in Wonderland' from Project Gutenberg to use as test input.

annette@DISSONANCE:~/scratch$ wget 'https://www.gutenberg.org/files/11/11-0.txt' -O alice.txt
--2021-06-13 02:06:33--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174313 (170K) [text/plain]
Saving to: ‘alice.txt’

alice.txt                                               100%[============================================================================================================================>] 170.23K   175KB/s    in 1.0s

2021-06-13 02:06:35 (175 KB/s) - ‘alice.txt’ saved [174313/174313]

annette@DISSONANCE:~/scratch$ time python slow_ocr_cleanup.py --headings-last < alice.txt > alice1.txt

real    0m0.065s
user    0m0.047s
sys     0m0.016s
annette@DISSONANCE:~/scratch$ time python slow_ocr_cleanup.py --headings-first < alice.txt > alice2.txt
^CTraceback (most recent call last):
  File "slow_ocr_cleanup.py", line 117, in <module>
    main()
  File "slow_ocr_cleanup.py", line 106, in main
    doc = fixUppercase(doc)
  File "slow_ocr_cleanup.py", line 17, in fixUppercase
    if word == line.split()[-1]:
KeyboardInterrupt

real    0m16.856s
user    0m8.438s
sys     0m8.375s
annette@DISSONANCE:~/scratch!1$ time python slow_ocr_cleanup.py --fixed < alice.txt > alice3.txt

real    0m0.058s
user    0m0.047s
sys     0m0.000s

As you can see, running without the fix was taking a long time so I stopped it early.

Here's the full test program:

import sys


def fixUppercase(doc):
    fixedText = ''
    for line in doc.split('\n'):
        fixedLine = ''
        for word in line.split():
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line.split()[-1]:
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line.split()[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
        fixedText += fixedLine

    return fixedText


def fixUppercaseFast(doc):
    fixedText = ''
    for line in doc.split('\n'):
        line_words = line.split()
        fixedLine = ''
        for word in line_words:
            if (
                word.isalpha()
                and (
                     word.isupper()
                     or word.istitle()
                     or word.islower()
                )
            ):
                if word == line_words[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
            elif (
                word.isalpha()
            ):
                lower = word.lower()
                if word == line_words[-1]:
                    fixedLine += lower + '\n'
                else:
                    fixedLine += lower + ' '
            else:
                if word == line_words[-1]:
                    fixedLine += word + '\n'
                else:
                    fixedLine += word + ' '
        fixedText += fixedLine

    return fixedText


def headingsFix(doc):
    fixedText = ''
    count = 0
    stopWords = ['on', 'and', 'of', 'as', 'for']
    for line in doc.split('\n'):
        tokenLine = ''
        for word in line.split():
            if word not in stopWords:
                tokenLine += word + " "
        if tokenLine.istitle() and (
            not line.endswith('.')
            and not line.endswith(',')
            and not line.endswith(')')
            and not line.endswith(';')
            and not line.endswith(':')
        ):

            count += 1
        else:
            fixedText += line

    return fixedText


def main():
    doc = sys.stdin.read()
    if '--headings-last' in sys.argv[1:]:
        doc = fixUppercase(doc)
        doc = headingsFix(doc)
    elif '--headings-first' in sys.argv[1:]:
        doc = headingsFix(doc)
        doc = fixUppercase(doc)
    elif '--fixed' in sys.argv[1:]:
        doc = headingsFix(doc)
        doc = fixUppercaseFast(doc)
    else:
        print('Specify --headings-last, --headings-first or --fixed', file=sys.stderr)
        sys.exit(1)
    print(doc, end='')


if __name__ == '__main__':
    main()

You'll note that the string concatenation isn't the source of the problem, although it's still inadvisable. In some some versions of Python there's an optimisation that makes it fast, but in general you can't rely on it always to work. This question and answer explain the problem in more detail, but broadly speaking, repeatedly using + or += to build larger and larger strings in a loop is inefficient because every time the whole string needs to be copied, and it's getting longer and longer as the loop goes on. It's a notorious pitfall known as Schlemiel the Painter's Algorithm. Better alternatives are to use str.join or io.StringIO.

Sign up to request clarification or add additional context in comments.

9 Comments

Thanks for the suggestion. It never occurred to me that every iteration would call the method again if it was in the header. (I would give you an upvote, but my rep is too low.) Q: how does this relate to the walrus operator? It's not reinstantiating the var on each iteration?
@theotechne The walrus operator is a name for the assignment expression introduced in Python 3.8 using := and I'm afraid I don't know how it relates to your question. I am afraid I don't really understand what you mean here when you say "in the header" or "reinstantiating the var on each iteration"... are you asking about splitting the words in a line, or about string concatenation, or something else?
Oh, wait, you were referring to the 'line.split()'s in the conditionals (if/elif/else)? Sorry, I guess that's another reason to not write the code like that. I lost track of those and thought you were referring to assigning line.split() to a var in first block instead of the second block loop header. (Also thanks for showing me what a good reproducible example looks like - still learning.)
Ah, sorry I wasn't clear. Yes, the calls to line.split() inside the body of the loop are the ones that are being called over and over.
I upvoted your answer but: "In this case CPython has an optimisation it can apply when you're appending to a string" seems to be discouraged per https://stackoverflow.com/questions/35787022/cython-string-concatenation-is-super-slow-what-else-does-it-do-poorly/35791033#35791033
|
1

Your fixUppercase() function basically does this:

  • change all alphabetical words that are not all lowercase, a proper title or all uppercase to all lowercase

However, you assume a document would only contain \n and as whitespace, so tabs (for example) would break your code. You could instead break up the document in space metacharacters and strings of the rest using regular expressions.

Your main problem is caused by the inefficiency of fixedUpper, so a solution would be to fix that.

This would do the same, but more efficiently:

import re

example="""
This is an example.

It Has:\ta fEw examples of thIngs that should be FIXED and CHANGED!

Don't touch this: a123B or this_Is_finE

Did it woRk?
"""


def fixedUpper(doc):
    p = re.compile(r'\s|([^\s]+)')
    # go through all the matches and join them back together into a string when done
    return ''.join(
        # lowercase for any alphabetic substring that does not contain whitespace and isn't a title or all uppercase
        m.group(1).lower() if not (m.group(1) is None or m.group(1).istitle() or m.group(1).isupper()) and m.group(1).isalpha()
        # in all other cases, just leave the match untouched
        else m.group(0)
        for m in p.finditer(doc)
    )


print(repr(fixedUpper(example)))

Output (note how it preserved the whitespace):

"\nThis is an example.\n\nIt Has:\ta few examples of things that should be FIXED and CHANGED!\n\nDon't touch this: a123B or this_Is_finE\n\nDid it woRk?\n"

Also note that this still has the problem your code does as well: if there's interpunction at the end of a word, it's not fixed, like woRk?

This is better:

def fixedUpper(doc):
    p = re.compile(r'\s|((\w+)([^\w\s]*))')
    return ''.join(
        m.group(1).lower()
        if not (m.group(2) is None or m.group(2).istitle() or m.group(2).isupper()) and m.group(2).isalpha()
        else m.group(0)
        for m in p.finditer(doc)
    )

1 Comment

Thanks for this regex example. It's more robust than what I came up with. (Some of the loop logic issue you mention don't apply with my specific project, but important to keep in mind for broader applications, so I appreciate it.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.