I have a lot of substitution patterns which I need for text cleaning. I load the data from a database and compile the regular expressions before for performance reasons. Unfortunately with my approach only the last assignment of the variable "text" seems to be valid, while the others appear to be overwritten:
# -*- coding: utf-8 -*-
import cx_Oracle
import re
connection = cx_Oracle.connect("SCHEMA", "passWORD", "TNS")
cursor = connection.cursor()
cursor.execute("""select column_1, column_2
from table""")
# Variables for matching
REPLACE_1 = re.compile(r'(sample_pattern_1)')
REPLACE_2 = re.compile(r'(sample_pattern_2)')
# ..
REPLACE_99 = re.compile(r'(sample_pattern_99)')
REPLACE_100 = re.compile(r'(sample_pattern_100)')
def extract_from_db():
text = ''
for row in cursor:
# sidenote: each substitution text has the the name as the corresponding variable name, but as a string of course
text = REPLACE_1.sub(r'REPLACE_1',str(row[0]))
text = REPLACE_2.sub(r'REPLACE_2',str(row[0]))
# ..
text = REPLACE_99.sub(r'REPLACE_99',str(row[0]))
text = REPLACE_100.sub(r'REPLACE_199',str(row[0]))
print text
extract_from_db()
Does anyone know how to solve this in a working, elegant way? Or do I have to pound this through huge if/elif control structure?