Summary: in this tutorial, you’ll learn about Python regex backreferences and how to apply them effectively.
Introduction to the Python regex backreferences #
Backreferences like variables in Python. The backreferences allow you to reference capturing groups within a regular expression.
The following shows the syntax of a backreference:
\NCode language: Python (python)Alternatively, you can use the following syntax:
\g<N>Code language: Python (python)In this syntax, N can be 1, 2, 3, etc. that represents the corresponding capturing group.
Note that the \g<0> refer to the entire match, which has the same value as the match.group(0).
Suppose you have a string with the duplicate word Python like this:
s = 'Python Python is awesome'Code language: Python (python)And you want to remove the duplicate word (Python) so that the result string will be:
Python is awesomeCode language: Python (python)To do that, you can use a regular expression with a backreference.
First, match a word with one or more characters and one or more space:
'\w+\s+'Code language: Python (python)Second, create a capturing group that contains only the word characters:
'(\w+)\s+'Code language: Python (python)Third, create a backreference that references the first capturing group:
'(\w+)\s+\1'Code language: Python (python)In this pattern, the \1 is a backreference that references the (\w+) capturing group.
Finally, replace the entire match with the first capturing group using the sub() function from the re module:
import re
s = 'Python Python is awesome'
new_s = re.sub(r'(\w+)\s+\1', r'\1', s)
print(new_s)Code language: Python (python)Output:
Python is awesomeCode language: Python (python)More Python regex backreference examples #
Let’s take some more examples of using backreferences.
1) Using Python regex backreferences to get text inside quotes #
Suppose you want to get the text within double quotes:
"This is regex backreference example"Code language: Python (python)Or single quote:
'This is regex backreference example'Code language: Python (python)But not mixed of single and double-quotes. The following will not match:
'not match"Code language: Python (python)To do this, you may use the following pattern:
'[\'"](.*?)[\'"]'Code language: Python (python)However, this pattern will match text that starts with a single quote (‘) and ends with a double quote (“) or vice versa. For example:
import re
s = '"Python\'s awsome". She said'
pattern = '[\'"].*?[\'"]'
match = re.search(pattern, s)
print(match.group(0))Code language: Python (python)It returns the "Python' not "Python's awesome":
"Python'Code language: Python (python)To fix it, you can use a backreference:
r'([\'"]).*?\1'Code language: Python (python)The backreference \1 refers to the first capturing group. So if the subgroup starts with a single quote, the \1 will match the single quote. And if the subgroup starts with a double-quote, the \1 will match the double-quote.
For example:
import re
s = '"Python\'s awsome". She said'
pattern = r'([\'"])(.*?)\1'
match = re.search(pattern, s)
print(match.group())Code language: Python (python)Output:
"Python's awsome"Code language: Python (python)2) Using Python regex backreferences to find words that have at least one consecutive repeated character #
The following example uses a backreference to find words that have at least one consecutive repeated character:
import re
words = ['apple', 'orange', 'strawberry']
pattern = r'\b\w*(\w)\1\w*\b'
results = [w for w in words if re.search(pattern, w)]
print(results)Code language: Python (python)Output:
['apple', 'strawberry']Code language: Python (python)Summary #
- Use a backreference
\Nto reference the capturing groupNin a regular expression.