Remove duplicate chars using regex?

Question

Let's say I want to remove all duplicate chars (of a particular char) in a string using regular expressions. This is simple -

import re
re.sub("a*", "a", "aaaa") # gives 'a'

What if I want to replace all duplicate chars (i.e. a,z) with that respective char? How do I do this?

import re
re.sub('[a-z]*', <what_to_put_here>, 'aabb') # should give 'ab'
re.sub('[a-z]*', <what_to_put_here>, 'abbccddeeffgg') # should give 'abcdefg'

NOTE: I know this remove duplicate approach can be better tackled with a hashtable or some O(n^2) algo, but I want to explore this using regexes

Amber · Accepted Answer · 2011-01-01 21:32:40Z

55

>>> import re
>>> re.sub(r'([a-z])\1+', r'\1', 'ffffffbbbbbbbqqq')
'fbq'

The () around the [a-z] specify a capture group, and then the \1 (a backreference) in both the pattern and the replacement refer to the contents of the first capture group.

Thus, the regex reads "find a letter, followed by one or more occurrences of that same letter" and then entire found portion is replaced with a single occurrence of the found letter.

On side note...

Your example code for just a is actually buggy:

>>> re.sub('a*', 'a', 'aaabbbccc')
'abababacacaca'

You really would want to use 'a+' for your regex instead of 'a*', since the * operator matches "0 or more" occurrences, and thus will match empty strings in between two non-a characters, whereas the + operator matches "1 or more".

edited Jan 1, 2011 at 21:32

answered Jan 1, 2011 at 15:28

Amber

531k89 gold badges643 silver badges558 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

rustyhu Over a year ago

For the buggy a* situation discussed in "On side note", I have run the example with a result of 'aabababacacaca' (one more leading a than yours, Python 3.10.8 [MSC v.1933 64 bit (AMD64)]). I think the explanation should be that "the * operator will match empty strings in between two non-a characters", and also between one a and one non-a character.

ThomasH · Accepted Answer · 2011-01-01 20:25:51Z

3

In case you are also interested in removing duplicates of non-contiguous occurrences you have to wrap things in a loop, e.g. like this

 s="ababacbdefefbcdefde"

 while re.search(r'([a-z])(.*)\1', s):
     s= re.sub(r'([a-z])(.*)\1', r'\1\2', s)

 print s  # prints 'abcdef'

answered Jan 1, 2011 at 20:25

ThomasH

23.7k13 gold badges64 silver badges70 bronze badges

4 Comments

Lennart Regebro Over a year ago

Or: s = ''.join(set(s)) ;) (Ok, not a regexp)

Oleg Melnikov Over a year ago

Does this work? An example:

s = 'good people understand'; while re.search(r'([a-z])(.*)\1', s): s = re.sub(r'([a-z])(.*)\1', r'\1\2', s); print(s) # prints "god pel unrsta"

ThomasH Over a year ago

@OlegMelnikov So, it does reduce each character to a single occurrence, so that looks good to me. It does not reduce the two spaces so you still get two in the output string. But space is not included in the regex, so I think that's ok too. If this is disturbing you you have to tweak the character class in the regex.

Oleg Melnikov Over a year ago

Hi Thomas. You are correct. My bad. In fact I see you bolded "non-contiguous" :) thanks for clarifying. I'll leave my example for others for clarification.

Joshua Varghese · Accepted Answer · 2020-05-17 16:19:44Z

0

A solution including all category:

re.sub(r'(.)\1+', r'\1', 'aaaaabbbbbb[[[[[')

gives:

'ab['

edited May 17, 2020 at 16:19

answered May 17, 2020 at 16:08

Joshua Varghese

5,2121 gold badge18 silver badges38 bronze badges

2 Comments

frozenade Over a year ago

it works, interesting. but what about the correct phrase which has double char, like: tell, smell, dwell, mall.

Joshua Varghese Over a year ago

You add filters? Dont expect regex to know English

Collectives™ on Stack Overflow

Remove duplicate chars using regex?

3 Answers 3

On side note...

1 Comment

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

On side note...

1 Comment

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related