1

I have a lot of substitution patterns which I need for text cleaning. I load the data from a database and compile the regular expressions before for performance reasons. Unfortunately with my approach only the last assignment of the variable "text" seems to be valid, while the others appear to be overwritten:

# -*- coding: utf-8 -*-
import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
        text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
        print text

extract_from_db()

Does anyone know how to solve this in a working, elegant way? Or do I have to pound this through huge if/elif control structure?

4 Answers 4

7

You keep replacing the last result with a replacement on str(row[0]). Use text instead to accumulate substitutions:

text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)

You'd be better of using an actual list instead:

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
    # ..
    (re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]

and use those in a loop:

text = str(row[0])
for pattern, replacement in REPLACEMENTS:
    text = pattern.sub(replacement, text)

or using functools.partial() to simplify the loop a bit further:

from functools import partial

REPLACEMENTS = [
    partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
    partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
    # ..
    partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
    partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]

and the loop:

text = str(row[0])
for replacement in REPLACEMENTS:
    text = replacement(text)

or using the above list of patterns wrapped in partial() objects, and reduce():

text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you mate, actually I was looking excactly for something what you posted here!
Hello Martijn, I tried the version with the first loop: For some reason, it replaces ~15 occurences, the next ~13 are not replaced, and then it starts the same way over and over. Very strange behaviour :/
My fault, I had the print statement in the wrong for-loop - now it works well!
1

Your approach is fine; however, on every line, you are applying the regex to the original string. You need to apply it to the result of the previous line, i.e.:

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        # This one stays the same - initialize from the row
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        # For these, route text back into it
        text = REPLACE_2.sub(r'REPLACE_2',text)
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',text)
        text = REPLACE_100.sub(r'REPLACE_100',text)
        print text

1 Comment

Hello, you need to indent your code in the source of your post.
1

It looks like what you need is:

    text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
    text = REPLACE_2.sub(r'REPLACE_1',text)
    # ..
    text = REPLACE_99.sub(r'REPLACE_99',text)
    text = REPLACE_100.sub(r'REPLACE_199',text)

Comments

1

Might I suggest building a list of patterns and their replacement values, then iterating across it? Then you don't have to modify the function every time you want to update the patterns:

import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
    (re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]

def extract_from_db():
    for row in cursor:
        text = str(row[0])
        for pattern, replacement in REPLACEMENTS:
            text = pattern.sub(replacement, text)

        print text

extract_from_db()

3 Comments

For some reason, it replaces ~15 occurences, the next ~13 are not replaced, and then it starts the same way over and over. Very strange behaviour :/
"print text" needs to be in the second, nested for-loop, then it works.
Thanks for pointing that out. I typed it in - I didn't actually run it. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.