0

I'm playing around with group backreferences in Python's Regex to try to understand them and I'm not having much luck.

import re

leftQuotes = re.compile("((\"|\“)([\w|\d]))")
rightQuotes = re.compile("(([\w|\d])(\"|\”))")

s = "This is “problematic”"

s = re.sub(leftQuotes, r'‘\3', s)
s = re.sub(rightQuotes, r'’\3', s)

print(s)

Output:

This is ‘problemati’”

In the first re.sub(), I managed to successfully replace the left double quotation mark with a single left quotation mark while keeping the matching character (in this case, a "p"). But the right side doesn't behave in the same way, regardless of the group backreference (1, 2, 3).

Results of backreferences:

\1: ‘problemati’c” 
\2: ‘problemati’c 
\3: ‘problemati’”
4
  • You overcaptured it. Use s = re.sub(rightQuotes, r'\2’', s), or better just remove unnecessary groups to only keep one that you need to keep, then just use Group 1 backreference. Commented Sep 29, 2017 at 19:59
  • @WiktorStribiżew That gives me This is ‘problemati’c Commented Sep 29, 2017 at 20:00
  • @WiktorStribiżew As I'd said originally in the post, all backreferences resulted in undesirable output. \1 gets This is ‘problemati’c” Commented Sep 29, 2017 at 20:03
  • 1
    As I said, you overcapture it, see ideone.com/rjLs4N Commented Sep 29, 2017 at 20:06

1 Answer 1

2

To fix your code, replace the second sub with:

s = re.sub(rightQuotes, r'\2’', s)

should work, since the word character in the second pattern comes as the second capture group and it should come before the single quote replacement as well.


Besides, you don't really need capture groups here, use look around would be cleaner, (though not critical quoting the string with single quote can save you some typing as @CasimiretHippolyte's comment):

import re
​
leftQuotes = re.compile('(?:"|“)(?=\w)')
rightQuotes = re.compile('(?<=\w)(?:"|”)')
​
s = "This is “problematic”"
​
s = re.sub(leftQuotes, r'‘', s)
s = re.sub(rightQuotes, r'’', s)
​
s
# 'This is ‘problematic’'

Also since \w includes \d, [\w|\d] can be replaced by \w.

Sign up to request clarification or add additional context in comments.

3 Comments

Instead of a lookahead or a lookbehind, it's more simple to use a word boundary.
@CasimiretHippolyte Good call.
I meant something like : (?:"|“)\b and \b(?:"|“)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.