3

I have a doubt about regex with backreference.

I need to match strings, I try this regex (\w)\1{1,} to capture repeated values of my string, but this regex only capture consecutive repeated strings; I'm stuck to improve my regex to capture all repeated values, below some examples:

import re

str = 'capitals'

re.search(r'(\w)\1{1,}', str)

Output None

import re

str = 'butterfly'

re.search(r'(\w)\1{1,}', str)

<_sre.SRE_Match object; span=(2, 4), match='tt'>

5
  • What are you trying to match in the first example? Commented Dec 8, 2017 at 17:45
  • You can use .* before the backreference to allow anything in between the matches. Commented Dec 8, 2017 at 17:47
  • @Barmar I'm trying to match the repeated occurrences of letter a Commented Dec 8, 2017 at 18:20
  • Use r'(\w)\w*\1' Commented Dec 8, 2017 at 18:22
  • @user3722709 You still haven't said what you expect the output to be. aa or apita? Commented Dec 8, 2017 at 20:10

2 Answers 2

6

I would use r'(\w).*\1 so that it allows any repeated character even if there are special characters or spaces in between.

However this wont work for strings with repeated characters overlapping the contents of groups like the string abcdabcd, in which it only recognizes the first group, ignoring the other repeated characters enclosed in the first group (b,c,d)

Check the demo: https://regex101.com/r/m5UfAe/1

So an alternative (and depending on your needs) is to sort the string analyzed:

import re
str = 'abcdabcde'
re.findall(r'(\w).*\1', ''.join(sorted(str)))

returning the array with the repeated characters ['a','b','c','d']

Sign up to request clarification or add additional context in comments.

2 Comments

It's worked here! But you can explain why when I remove the sorted built-in function the output is not correct?!? output with sorted: re.findall(regex_pattern, ''.join(sorted("testing this".lower()))) ['i', 's', 't'] output without sorted: re.findall(regex_pattern, ''.join("testing this".lower())) ['t']
@user3722709 If you don't sort it, you're just returning the same string.
2

Hope the code below will help you understand the Backreference concept of Python RegEx

There are two sets of information available in the given string str

  1. Employee Basic Info:

    • starting with @employeename and ends with employeename
    • eg: @daniel dxc chennai 45000 male daniel
  2. Employee designation

    • starting with %employeename then designation and ends with employeename%
    • eg: %daniel python developer daniel%
import re

#sample input

str="""
@daniel dxc chennai 45000 male daniel @henry infosys bengaluru 29000 male hobby- 
swimming henry
@raja zoho chennai 37000 male raja @ramu infosys bengaluru 99000 male hobby-badminton 
ramu
%daniel python developer daniel% %henry database admin henry%
%raja Testing lead raja% %ramu Manager ramu%
"""

#backreferencing employee name (\w+)  <----  \1
#----------------------------------------------
basic_info=re.findall(r'@+(\w+)(.*?)\1',str)
print(basic_info)

#(%) <-- \1  and (\w+) <--- \2 
#-------------------------------
designation=re.findall(r'(%)+(\w+)(.*?)\2\1',str)
print(designation)

for i in range(len(designation)):
    designation[i]=(designation[i][1],designation[i][2])
print(designation)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.