Making Python RegEx use variables for string expressions

Question

I have a .csv file with the regular expression patterns that I want to match as well as the replacement patterns that I want. Some are extremely simple, such as "." -> "" or "," -> "".

When I run the following code, however, it doesn't seem to recognize the variables and the pattern is never matched.

                f = open('normalize_patterns.csv', 'rU')
                c = csv.DictReader(f)
                for row in c:
                    v = re.sub(row['Pattern'],row['Replacement'],v)

Afterwards, v is never changed and I can't seem to find out why. When I run the simple case of

                v = re.sub("\.", "", v)
                v = re.sub(",", "", v)

however, all the periods and commas are removed. Any help on the issue would be amazing. Thank you in advance! (I am pretty sure that the .csv file is formatted correctly, I've run it with just the "." and "" case and it still does not work for a certain reason)

Edit: Here are the outputs of printing row. (Thanks David!)

{'Pattern': "r'(?i)&'", 'ID': '1', 'Replacement': "'and'"}
{'Pattern': "r'(?i)\\bAssoc\\b\\.?'", 'ID': '2', 'Replacement': "'Association'"}
{'Pattern': "r'(?i)\\bInc\\b\\.?'", 'ID': '3', 'Replacement': "'Inc.'"}
{'Pattern': "r'(?i)\\b(L\\.?){2}P\\.?'", 'ID': '4', 'Replacement': "''"}
{'Pattern': "r'(?i)\\bUniv\\b\\.?'", 'ID': '5', 'Replacement': "'University'"}
{'Pattern': "r'(?i)\\bCorp\\b\\.?'", 'ID': '6', 'Replacement': "'Corporation'"}
{'Pattern': "r'(?i)\\bAssn\\b\\.?'", 'ID': '7', 'Replacement': "'Association'"}
{'Pattern': "r'(?i)\\bUnivesity\\b'", 'ID': '8', 'Replacement': "'University'"}
{'Pattern': "r'(?i)\\bIntl\\b\\.?'", 'ID': '9', 'Replacement': "'International'"}
{'Pattern': "r'(?i)\\bInst\\b\\.?'", 'ID': '10', 'Replacement': "'Institute'"}
{'Pattern': "r'(?i)L\\.L\\.C\\.'", 'ID': '11', 'Replacement': "'LLC'"} 
{'Pattern': "r'(?i)Chtd'", 'ID': '12', 'Replacement': "'Chartered'"}
{'Pattern': "r'(?i)Mfg\\b\\.?'", 'ID': '13', 'Replacement': "'Manufacturing'"}
{'Pattern': 'r"Nat\'l"', 'ID': '14', 'Replacement': "'National'"}
{'Pattern': "r'(?i)Flordia'", 'ID': '15', 'Replacement': "'Florida'"}
{'Pattern': "r'(?i)\\bLtd\\b\\.?'", 'ID': '16', 'Replacement': "'Ltd.'"}
{'Pattern': "r'(?i)\\bCo\\b\\.?'", 'ID': '17', 'Replacement': "'Company'"}
{'Pattern': "r'(?i)\\bDept\\b\\.?i\\'", 'ID': '18', 'Replacement': "'Department'"}
{'Pattern': "r'(?i)Califronia'", 'ID': '19', 'Replacement': "'California'"}
{'Pattern': "r'(?i)\\bJohn\\bHopkins\\b'", 'ID': '20', 'Replacement': "'Johns Hopkins'"}
{'Pattern': "r'(?i)\\bOrg\\b\\.?'", 'ID': '21', 'Replacement': "'Organization'"}
{'Pattern': "r'(?i)^[T]he\\s'", 'ID': '22', 'Replacement': "''"}
{'Pattern': "r'(?i)\\bAuth\\b\\.?'", 'ID': '23', 'Replacement': "'Authority'"}
{'Pattern': "r'.'", 'ID': '24', 'Replacement': "''"}
{'Pattern': "r','", 'ID': '25', 'Replacement': "''"}
{'Pattern': "r'(?i)\\s+'", 'ID': '0', 'Replacement': "''"}

And here are a few lines of the csv file (Opened in TextMate)

0,r'(?i)\s+',''
1,r'(?i)&','and'
2,r'(?i)\bAssoc\b\.?','Association'
3,r'(?i)\bInc\b\.?','Inc.'

There are two issues with your code. First there is an extra }, second I think you mean row["Replacement"] instead of row[replacement]. — Howard
– Howard, Commented Jul 7, 2011 at 17:17
Seconding, CPP, to help you we really need to know what the values of row['Pattern'], row['Replacement'], and v are. — senderle
– senderle, Commented Jul 7, 2011 at 17:29
Sorry! Yeah I meant row["Replacement"], typo (I had a string formatted "%(replacement)s" % {"replacement": row["Replacement"]} before but I just took it out for readability). (Fixed Now) — Kevin Shin
– Kevin Shin, Commented Jul 7, 2011 at 17:36
@Kevin: Don't post the input in a comment, add it to the original post by editing. — Jim Garrison
– Jim Garrison, Commented Jul 7, 2011 at 17:48

Andrew Clark · Accepted Answer · 2011-07-07 18:58:47Z

2

Your issue is that your pattern values are not actually the regex pattern you want, your regex pattern is wrapped in an additional string.

For example, in your dictionary you have the value "r'.'", which you are using as a pattern. You code will run re.sub("r'.'", "", v), which probably isn't what you want:

>>> re.sub("r'.'", "", "This . won't match")
"This . won't match"
>>> re.sub("r'.'", "", "This r'x' will match")
'This  will match'

To fix this you should go back to where you are adding the regex to the dictionary and stop doing whatever is causing the string wrapping. It might be something like row['Pattern'] = repr(regex).

If you need to keep the dictionary the same for reason then be very careful with eval, if the strings are coming from an untrusted source then eval is a big security risk. Use ast.literal_eval instead.

edited Jul 7, 2011 at 18:58

answered Jul 7, 2011 at 18:52

Andrew Clark

210k36 gold badges285 silver badges310 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kevin Shin Over a year ago

Definitely makes sense, eval made some errors regarding '\.' since the backslash is read as a literal is translated to '\\.' Thanks Andrew!

Jingshao Chen · Accepted Answer · 2011-07-07 18:41:19Z

2

If you remove the r'' around the pattern, it will work.

So the pattern that matches . should be as simple as '\.' instead of "r'\.'"

The problem is r in your pattern is taken as a literal r instead of it raw string meaning.

So you can also try: v=re.sub(eval(row['Pattern']), row['Replacement'], v)

edited Jul 7, 2011 at 18:41

answered Jul 7, 2011 at 18:28

Jingshao Chen

3,5152 gold badges29 silver badges34 bronze badges

4 Comments

Kevin Shin Over a year ago

Sorry, doesn't seem to have done anything: I think the r is optional with python regex.

Jingshao Chen Over a year ago

r means raw string. if you have r in the value of the pattern, it will be a literal r instead of telling python what follows is a raw string.

Kevin Shin Over a year ago

So for the inputs outputs I've changed to: 0,'(?i)\s+','' 1,'(?i)&','and' 2,'(?i)\bAssoc\b\.?','Association' Was this what you meant by remove the 'r' character?

Jingshao Chen Over a year ago

See my edit. You can put row['Pattern'] into an eval to make use of the r prefix

Collectives™ on Stack Overflow

Making Python RegEx use variables for string expressions

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related