4

I have a .csv file with the regular expression patterns that I want to match as well as the replacement patterns that I want. Some are extremely simple, such as "." -> "" or "," -> "".

When I run the following code, however, it doesn't seem to recognize the variables and the pattern is never matched.

                f = open('normalize_patterns.csv', 'rU')
                c = csv.DictReader(f)
                for row in c:
                    v = re.sub(row['Pattern'],row['Replacement'],v)

Afterwards, v is never changed and I can't seem to find out why. When I run the simple case of

                v = re.sub("\.", "", v)
                v = re.sub(",", "", v)

however, all the periods and commas are removed. Any help on the issue would be amazing. Thank you in advance! (I am pretty sure that the .csv file is formatted correctly, I've run it with just the "." and "" case and it still does not work for a certain reason)

Edit: Here are the outputs of printing row. (Thanks David!)

{'Pattern': "r'(?i)&'", 'ID': '1', 'Replacement': "'and'"}
{'Pattern': "r'(?i)\\bAssoc\\b\\.?'", 'ID': '2', 'Replacement': "'Association'"}
{'Pattern': "r'(?i)\\bInc\\b\\.?'", 'ID': '3', 'Replacement': "'Inc.'"}
{'Pattern': "r'(?i)\\b(L\\.?){2}P\\.?'", 'ID': '4', 'Replacement': "''"}
{'Pattern': "r'(?i)\\bUniv\\b\\.?'", 'ID': '5', 'Replacement': "'University'"}
{'Pattern': "r'(?i)\\bCorp\\b\\.?'", 'ID': '6', 'Replacement': "'Corporation'"}
{'Pattern': "r'(?i)\\bAssn\\b\\.?'", 'ID': '7', 'Replacement': "'Association'"}
{'Pattern': "r'(?i)\\bUnivesity\\b'", 'ID': '8', 'Replacement': "'University'"}
{'Pattern': "r'(?i)\\bIntl\\b\\.?'", 'ID': '9', 'Replacement': "'International'"}
{'Pattern': "r'(?i)\\bInst\\b\\.?'", 'ID': '10', 'Replacement': "'Institute'"}
{'Pattern': "r'(?i)L\\.L\\.C\\.'", 'ID': '11', 'Replacement': "'LLC'"} 
{'Pattern': "r'(?i)Chtd'", 'ID': '12', 'Replacement': "'Chartered'"}
{'Pattern': "r'(?i)Mfg\\b\\.?'", 'ID': '13', 'Replacement': "'Manufacturing'"}
{'Pattern': 'r"Nat\'l"', 'ID': '14', 'Replacement': "'National'"}
{'Pattern': "r'(?i)Flordia'", 'ID': '15', 'Replacement': "'Florida'"}
{'Pattern': "r'(?i)\\bLtd\\b\\.?'", 'ID': '16', 'Replacement': "'Ltd.'"}
{'Pattern': "r'(?i)\\bCo\\b\\.?'", 'ID': '17', 'Replacement': "'Company'"}
{'Pattern': "r'(?i)\\bDept\\b\\.?i\\'", 'ID': '18', 'Replacement': "'Department'"}
{'Pattern': "r'(?i)Califronia'", 'ID': '19', 'Replacement': "'California'"}
{'Pattern': "r'(?i)\\bJohn\\bHopkins\\b'", 'ID': '20', 'Replacement': "'Johns Hopkins'"}
{'Pattern': "r'(?i)\\bOrg\\b\\.?'", 'ID': '21', 'Replacement': "'Organization'"}
{'Pattern': "r'(?i)^[T]he\\s'", 'ID': '22', 'Replacement': "''"}
{'Pattern': "r'(?i)\\bAuth\\b\\.?'", 'ID': '23', 'Replacement': "'Authority'"}
{'Pattern': "r'.'", 'ID': '24', 'Replacement': "''"}
{'Pattern': "r','", 'ID': '25', 'Replacement': "''"}
{'Pattern': "r'(?i)\\s+'", 'ID': '0', 'Replacement': "''"}

And here are a few lines of the csv file (Opened in TextMate)

0,r'(?i)\s+',''
1,r'(?i)&','and'
2,r'(?i)\bAssoc\b\.?','Association'
3,r'(?i)\bInc\b\.?','Inc.'
9
  • 4
    Could you post an exact test-case input? Commented Jul 7, 2011 at 17:15
  • 1
    There are two issues with your code. First there is an extra }, second I think you mean row["Replacement"] instead of row[replacement]. Commented Jul 7, 2011 at 17:17
  • 1
    Seconding, CPP, to help you we really need to know what the values of row['Pattern'], row['Replacement'], and v are. Commented Jul 7, 2011 at 17:29
  • Sorry! Yeah I meant row["Replacement"], typo (I had a string formatted "%(replacement)s" % {"replacement": row["Replacement"]} before but I just took it out for readability). (Fixed Now) Commented Jul 7, 2011 at 17:36
  • @Kevin: Don't post the input in a comment, add it to the original post by editing. Commented Jul 7, 2011 at 17:48

2 Answers 2

2

Your issue is that your pattern values are not actually the regex pattern you want, your regex pattern is wrapped in an additional string.

For example, in your dictionary you have the value "r'.'", which you are using as a pattern. You code will run re.sub("r'.'", "", v), which probably isn't what you want:

>>> re.sub("r'.'", "", "This . won't match")
"This . won't match"
>>> re.sub("r'.'", "", "This r'x' will match")
'This  will match'

To fix this you should go back to where you are adding the regex to the dictionary and stop doing whatever is causing the string wrapping. It might be something like row['Pattern'] = repr(regex).

If you need to keep the dictionary the same for reason then be very careful with eval, if the strings are coming from an untrusted source then eval is a big security risk. Use ast.literal_eval instead.

Sign up to request clarification or add additional context in comments.

1 Comment

Definitely makes sense, eval made some errors regarding '\.' since the backslash is read as a literal is translated to '\\.' Thanks Andrew!
2

If you remove the r'' around the pattern, it will work.

So the pattern that matches . should be as simple as '\.' instead of "r'\.'"

The problem is r in your pattern is taken as a literal r instead of it raw string meaning.

So you can also try: v=re.sub(eval(row['Pattern']), row['Replacement'], v)

4 Comments

Sorry, doesn't seem to have done anything: I think the r is optional with python regex.
r means raw string. if you have r in the value of the pattern, it will be a literal r instead of telling python what follows is a raw string.
So for the inputs outputs I've changed to: 0,'(?i)\s+','' 1,'(?i)&','and' 2,'(?i)\bAssoc\b\.?','Association' Was this what you meant by remove the 'r' character?
See my edit. You can put row['Pattern'] into an eval to make use of the r prefix

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.