List of compiled regexes in Python

Question

I have a lot of substitution patterns which I need for text cleaning. I load the data from a database and compile the regular expressions before for performance reasons. Unfortunately with my approach only the last assignment of the variable "text" seems to be valid, while the others appear to be overwritten:

# -*- coding: utf-8 -*-
import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
        text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
        print text

extract_from_db()

Does anyone know how to solve this in a working, elegant way? Or do I have to pound this through huge if/elif control structure?

Martijn Pieters · Accepted Answer · 2014-03-11 16:36:03Z

7

You keep replacing the last result with a replacement on str(row[0]). Use text instead to accumulate substitutions:

text = REPLACE_1.sub(r'REPLACE_1', str(row[0]))
text = REPLACE_1.sub(r'REPLACE_1', text)
# ..
text = REPLACE_99.sub(r'REPLACE_99', text)
text = REPLACE_100.sub(r'REPLACE_199', text)

You'd be better of using an actual list instead:

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), r'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), r'REPLACE_2'),
    # ..
    (re.compile(r'(sample_pattern_99)'), r'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), r'REPLACE_100'),
]

and use those in a loop:

text = str(row[0])
for pattern, replacement in REPLACEMENTS:
    text = pattern.sub(replacement, text)

or using functools.partial() to simplify the loop a bit further:

from functools import partial

REPLACEMENTS = [
    partial(re.compile(r'(sample_pattern_1)').sub, r'REPLACE_1'),
    partial(re.compile(r'(sample_pattern_2)').sub, r'REPLACE_2'),
    # ..
    partial(re.compile(r'(sample_pattern_99)').sub, r'REPLACE_99'),
    partial(re.compile(r'(sample_pattern_100)').sub, r'REPLACE_100'),
]

and the loop:

text = str(row[0])
for replacement in REPLACEMENTS:
    text = replacement(text)

or using the above list of patterns wrapped in partial() objects, and reduce():

text = reduce(lambda txt, repl: repl(txt), REPLACEMENTS, str(row[0])

edited Mar 11, 2014 at 16:36

answered Mar 11, 2014 at 16:27

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

royskatt Over a year ago

Thank you mate, actually I was looking excactly for something what you posted here!

royskatt Over a year ago

Hello Martijn, I tried the version with the first loop: For some reason, it replaces ~15 occurences, the next ~13 are not replaced, and then it starts the same way over and over. Very strange behaviour :/

royskatt Over a year ago

My fault, I had the print statement in the wrong for-loop - now it works well!

Corley Brigman · Accepted Answer · 2014-03-11 16:21:57Z

1

Your approach is fine; however, on every line, you are applying the regex to the original string. You need to apply it to the result of the previous line, i.e.:

def extract_from_db():
    text = ''
    for row in cursor:
        # sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
        # This one stays the same - initialize from the row
        text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
        # For these, route text back into it
        text = REPLACE_2.sub(r'REPLACE_2',text)
        # ..
        text = REPLACE_99.sub(r'REPLACE_99',text)
        text = REPLACE_100.sub(r'REPLACE_100',text)
        print text

answered Mar 11, 2014 at 16:21

Corley Brigman

12.5k5 gold badges35 silver badges41 bronze badges

1 Comment

user1054158 Over a year ago

Hello, you need to indent your code in the source of your post.

timrau · Accepted Answer · 2014-03-11 16:23:50Z

1

It looks like what you need is:

    text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
    text = REPLACE_2.sub(r'REPLACE_1',text)
    # ..
    text = REPLACE_99.sub(r'REPLACE_99',text)
    text = REPLACE_100.sub(r'REPLACE_199',text)

answered Mar 11, 2014 at 16:23

timrau

23.1k4 gold badges55 silver badges67 bronze badges

Comments

Kirk Strauser · Accepted Answer · 2014-03-12 15:40:59Z

1

Might I suggest building a list of patterns and their replacement values, then iterating across it? Then you don't have to modify the function every time you want to update the patterns:

import cx_Oracle
import re

connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")

REPLACEMENTS = [
    (re.compile(r'(sample_pattern_1)'), 'REPLACE_1'),
    (re.compile(r'(sample_pattern_2)'), 'REPLACE_2'),
# ..
    (re.compile(r'(sample_pattern_99)'), 'REPLACE_99'),
    (re.compile(r'(sample_pattern_100)'), 'REPLACE_100'),
]

def extract_from_db():
    for row in cursor:
        text = str(row[0])
        for pattern, replacement in REPLACEMENTS:
            text = pattern.sub(replacement, text)

        print text

extract_from_db()

edited Mar 12, 2014 at 15:40

answered Mar 11, 2014 at 16:34

Kirk Strauser

31.1k5 gold badges53 silver badges69 bronze badges

3 Comments

royskatt Over a year ago

For some reason, it replaces ~15 occurences, the next ~13 are not replaced, and then it starts the same way over and over. Very strange behaviour :/

royskatt Over a year ago

"print text" needs to be in the second, nested for-loop, then it works.

Kirk Strauser Over a year ago

Thanks for pointing that out. I typed it in - I didn't actually run it. :-)

Collectives™ on Stack Overflow

List of compiled regexes in Python

4 Answers 4

3 Comments

1 Comment

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related