18

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -

import re
re.sub("a*", "a", "aaaa") # gives 'a'

What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?

import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'

NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes

3 Answers 3

55
>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'

The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.

Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.

On side note...

Your example code for just a is actually buggy:

>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'

You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".

Sign up to request clarification or add additional context in comments.

1 Comment

For the buggy a* situation discussed in "On side note", I have run the example with a result of 'aabababacacaca' (one more leading a than yours, Python 3.10.8 [MSC v.1933 64 bit (AMD64)]). I think the explanation should be that "the * operator will match empty strings in between two non-a characters", and also between one a and one non-a character.
3

In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this

 s="ababacbdefefbcdefde"

 while re.search(r'([a-z])(.*)\1', s):
     s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)

 print s  # prints 'abcdef'

4 Comments

Or: s = ''.join(set(s)) ;) (Ok, not a regexp)
Does this work? An example: s = 'good people understand'; while re.search(r'([a-z])(.*)\1', s): s = re.sub(r'([a-z])(.*)\1', r'\1\2', s); print(s) # prints "god pel unrsta"
@OlegMelnikov So, it does reduce each character to a single occurrence, so that looks good to me. It does not reduce the two spaces so you still get two in the output string. But space is not included in the regex, so I think that's ok too. If this is disturbing you you have to tweak the character class in the regex.
Hi Thomas. You are correct. My bad. In fact I see you bolded "non-contiguous" :) thanks for clarifying. I'll leave my example for others for clarification.
0

A solution including all category:

re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')

gives:

'ab['

2 Comments

it works, interesting. but what about the correct phrase which has double char, like: tell, smell, dwell, mall.
You add filters? Dont expect regex to know English

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.