Removing duplicates in a string in Python

Question

What is an efficient algorithm to removing all duplicates in a string?

For example : aaaabbbccdbdbcd

Required result: abcd

Do you need to maintain / impose order on the result?

Rob Fonseca-Ensor
– Rob Fonseca-Ensor

2010-02-18 07:13:52 +00:00
Commented Feb 18, 2010 at 7:13 — Rob Fonseca-Ensor
– Rob Fonseca-Ensor, Commented Feb 18, 2010 at 7:13

cletus · Accepted Answer · 2010-02-18 07:12:14Z

19

You use a hashtable to store currently discovered keys (access O(1)) and then loop through the array. If a character is in the hashtable, discard it. If it isn't add it to the hashtable and a result string.

Overall: O(n) time (and space).

The naive solution is to search for the character is the result string as you process each one. That O(n²).

answered Feb 18, 2010 at 7:12

cletus

627k169 gold badges922 silver badges945 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Adriaan Stander Over a year ago

+1, Or if they have accss to it HashSet msdn.microsoft.com/en-us/library/bb495294.aspx

Ritsaert Hornstra Over a year ago

If you have a large string compared the possible # vakues of the characters (eg like if it is ASCII), you might use a =n array of bools instead on a hashtable

Thomas Jung Over a year ago

The best case to retrieve a value from hashtable is O(1) and the worst case O(n). The overall worst case complexity for the algorithm is O(n^2).

jk. Over a year ago

that is irrelavent in this case as you by definition of the algo have either 0 or 1 item for each hash key

Thomas Jung Over a year ago

@jk The hashtable has always 0 or 1 entries for a key. The worst case O(n) is that all n values are in one bucket.

|

Community · Accepted Answer · 2017-05-23 12:01:02Z

5

This closely related to the question: Detecting repetition with infinite input.

The hashtable approach may not be optimal depending on your input. Hashtables have a certain amount of overhead (buckets, entry objects). It is huge overhead compared to the actual stored char. (If you target environment is Java it is even worse as the HashMap is of type Map<Character,?>.) The worse case runtime for a Hashtable access is O(n) due to collisions.

You need only 8kb too represent all 2-byte unicode characters in a plain BitSet. This may be optimized if your input character set is more restricted or by using a compressed BitSets (as long as you have a sparse BitSet). The runtime performance will be favorable for a BitSet it is O(1).

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Feb 18, 2010 at 8:28

Thomas Jung

33.2k9 gold badges86 silver badges117 bronze badges

2 Comments

Matthieu M. Over a year ago

I am afraid to mention that you are mixing (somehow) concepts and implementations. I view the fact that you are using a BitSet to implement your own HashTable as a proof that the HashTable is a perfectly viable solution.

Thomas Jung Over a year ago

@Matthieu Using a Hashtable or a BitSet has certain trade-offs. The hashtable works best for small sets of characters. The BitSet works best when the number of characters is large or can be restricted to a known range. A BitSet is not a Hashtable. The Hashtable here is used as a Set as someone mentioned. The BitSet is used analogous. If you can replace one with the other does not mean that they are equally good solutions.

John La Rooy · Accepted Answer · 2010-02-25 00:16:03Z

5

In Python

>>> ''.join(set("aaaabbbccdbdbcd"))
'acbd'

If the order needs to be preserved

>>> q="aaaabbbccdbdbcd"                    # this one is not
>>> ''.join(sorted(set(q),key=q.index))    # so efficient
'abcd'

or

>>> S=set()
>>> res=""
>>> for c in "aaaabbbccdbdbcd":
...  if c not in S:
...   res+=c
...   S.add(c)
... 
>>> res
'abcd'

or

>>> S=set()
>>> L=[]
>>> for c in "aaaabbbccdbdbcd":
...  if c not in S:
...   L.append(c)
...   S.add(c)
... 
>>> ''.join(L)
'abcd'

In python3.1

>>> from collections import OrderedDict
>>> ''.join(list(OrderedDict((c,0) for c in "aaaabbbccdbdbcd").keys()))
'abcd'

edited Feb 25, 2010 at 0:16

answered Feb 18, 2010 at 7:51

John La Rooy

306k54 gold badges378 silver badges513 bronze badges

2 Comments

Carson Myers Over a year ago

I knew set would be awesome for this, but I'm new to python and was trying to figure out how to join them while you posted this... Now I know!

John La Rooy Over a year ago

@recursive, I added some order preserving options

SPWorley · Accepted Answer · 2010-02-18 07:13:15Z

2

Keep an array of 256 "seen" booleans, one for each possible character. Stream your string. If you haven't seen the character before, output it and set the "seen" flag for that character.

answered Feb 18, 2010 at 7:13

SPWorley

11.9k9 gold badges45 silver badges63 bronze badges

1 Comment

user184968 Over a year ago

It has not been told what coding is used, though

Stano · Accepted Answer · 2012-07-02 18:38:12Z

2

PHP algorythm - O(n):

function remove_duplicate_chars($str) {
    if (2 > $len = strlen($str)) {
        return $str;
    }
    $flags = array_fill(0,256,false);
    $flags[ord($str[0])]=true;
    $j = 1;
    for ($i=1; $i<$len; $i++) {
        $ord = ord($str[$i]);
        if (!$flags[$ord]) {
            $str[$j] = $str[$i];
            $j++;
            $flags[$ord] = true;
        }
    }
    if ($j<$i) { //if duplicates removed
        $str = substr($str,0,$j);
    }
    return $str;
}

echo remove_duplicate_chars('aaaabbbccdbdbcd'); // result: 'abcd'

edited Jul 2, 2012 at 18:38

answered Jul 2, 2012 at 18:13

Stano

8,9696 gold badges33 silver badges46 bronze badges

Comments

Kara · Accepted Answer · 2014-02-05 18:19:59Z

2

You Can Do this in O(n) only if you are using HashTable. Code is given below Please Note- It is assumed that number of possible characters in input string are 256

void removeDuplicates(char *str)
{
 int len = strlen(str); //Gets the length of the String
 int count[256] = {0};  //initializes all elements as zero
 int i;
     for(i=0;i<len;i++)
     {
        count[str[i]]++;  
        if(count[str[i]] == 1)
          printf("%c",str[i]);                  
     }     
}

edited Feb 5, 2014 at 18:19

Kara

6,23616 gold badges54 silver badges58 bronze badges

answered Feb 5, 2014 at 17:56

RahulKT

1714 bronze badges

Comments

TheMan · Accepted Answer · 2013-07-14 20:24:54Z

1

#include <iostream>
#include<string>
using namespace std;
#define MAX_SIZE 256

int main()
{
    bool arr[MAX_SIZE] = {false};

    string s;
    cin>>s;
    int k = 0;

    for(int i = 0; i < s.length(); i++)
    {
        while(arr[s[i]] == true && i < s.length())
        {
            i++;
        }
        if(i < s.length())
        {
            s[k]    = s[i];
            arr[s[k]] = true;
            k++;
        }
    }
    s.resize(k);

    cout << s<< endl; 

    return 0;
}

answered Jul 14, 2013 at 20:24

TheMan

7138 silver badges11 bronze badges

4 Comments

syntagma Over a year ago

What does the arr[s[i]] == true mean?

TheMan Over a year ago

Just having arr[s[i]] there would do too. Is that what you meant?

syntagma Over a year ago

arr[s[i]] is what I'm interested in - do I understand correctly that you use char (which is an int) to index the array?

TheMan Over a year ago

Yes, that's right and it should work. What's the issue with that?

Amgad Fahmi · Accepted Answer · 2010-02-18 07:45:26Z

0

  string newString = new string("aaaaabbbbccccdddddd".ToCharArray().Distinct().ToArray());

or

 char[] characters = "aaaabbbccddd".ToCharArray();
                string result = string.Empty ;
                foreach (char c in characters)
                {
                    if (result.IndexOf(c) < 0)
                        result += c.ToString();
                }

edited Feb 18, 2010 at 7:45

answered Feb 18, 2010 at 7:13

Amgad Fahmi

4,3493 gold badges22 silver badges18 bronze badges

3 Comments

Rob Fonseca-Ensor Over a year ago

O(n^2) isn't very efficient... (once the data set gets big enough). For small strings this is probably faster than a hashset based lookup though

Amgad Fahmi Over a year ago

i agree , what about the new one ?

cjk Over a year ago

String concatenation in a loop will be slower than the searching of the character within the string...

Jerry Coffin · Accepted Answer · 2010-02-18 08:48:51Z

0

In C++, you'd probably use an std::set:

std::string input("aaaabbbccddd");
std::set<char> unique_chars(input.begin(), input.end());

In theory you could use std::unordered_set instead of std::set, which should give O(N) expected overall complexity (though O(N²) worst case), where this one is O(N lg M) (where N=number of total characters, M=number of unique characters). Unless you have long strings with a lot of unique characters, this version will probably be faster though.

answered Feb 18, 2010 at 8:48

Jerry Coffin

494k83 gold badges656 silver badges1.2k bronze badges

Comments

fredoverflow · Accepted Answer · 2010-02-24 23:48:41Z

0

You can sort the string and then remove the duplicate characters.

#include <iostream>
#include <algorithm>
#include <string>

int main()
{
    std::string s = "aaaabbbccdbdbcd";

    std::sort(s.begin(), s.end());
    s.erase(std::unique(s.begin(), s.end()), s.end());

    std::cout << s << std::endl;
}

answered Feb 24, 2010 at 23:48

fredoverflow

265k101 gold badges403 silver badges679 bronze badges

Comments

jbrennan · Accepted Answer · 2010-02-25 00:09:38Z

0

This sounds like a perfect use for automata.

answered Feb 25, 2010 at 0:09

jbrennan

12k14 gold badges77 silver badges116 bronze badges

Comments

JoeG · Accepted Answer · 2010-02-25 00:26:37Z

0

C++ - O(n) time, O(1) space, and the output is sorted.

std::string characters = "aaaabbbccddd";
std::vector<bool> seen(std::numeric_limits<char>::max()-std::numeric_limits<char>::min());

for(std::string::iterator it = characters.begin(), endIt = characters.end(); it != endIt; ++it) {
  seen[(*it)-std::numeric_limits<char>::min()] = true;
}

characters = "";
for(char ch = std::numeric_limits<char>::min(); ch != std::numeric_limits<char>::max(); ++ch) {
  if( seen[ch-std::numeric_limits<char>::min()] ) {
    characters += ch;
  }
}

answered Feb 25, 2010 at 0:26

JoeG

13.2k1 gold badge40 silver badges64 bronze badges

2 Comments

letsc Over a year ago

O(1) space??? I see a vector of bools.. Isnt is same as an array of bools of 256 characters??

JoeG Over a year ago

@smartmuki: It's O(1) space because the size of the vector<bool> does not vary according to the size of the input - it's 256 bools no matter what the input is.

infinitloop · Accepted Answer · 2011-05-04 17:33:09Z

0

in C this is how i did it: O(n) in time since we only have one for loop.

void remDup(char *str)
{
    int flags[256] = { 0 };

    for(int i=0; i<(int)strlen(str); i++) {
        if( flags[str[i]] == 0 )
            printf("%c", str[i]);

        flags[str[i]] = 1;
    }
}

answered May 4, 2011 at 17:33

infinitloop

3,0118 gold badges40 silver badges57 bronze badges

Comments

ACC · Accepted Answer · 2012-11-07 07:31:51Z

0

import java.util.HashSet;

public class RemoveDup {

    public static String Duplicate()
    {
        HashSet h = new HashSet();
        String value = new String("aaaabbbccdbdbcd");
        String finalString = new String();
        int stringLength = value.length();
        for (int i=0;i<=stringLength-1;i++)
        {
            if(h.add(value.charAt(i)))
            {
                finalString = finalString + (value.charAt(i));
            }


        }
        return finalString;

    }
public static void main(String[] args) {


        System.out.println(Duplicate());
    }
}

edited Nov 7, 2012 at 7:31

ACC

2,5906 gold badges37 silver badges64 bronze badges

answered Sep 30, 2012 at 3:41

SuperMan

3,59412 gold badges47 silver badges49 bronze badges

Comments

Anwit · Accepted Answer · 2013-03-29 19:06:49Z

get a list of first 26 prime numbers.. Now you can map each character (a,b,c,d etc) to each prime number.. (alphabetically say a=2, b=3, c=5 etc.. or depending upon relative abundance of the characters like most frequently used letter with lower prime say e=2, r=3, a=5 etc)...store that mapping in an integer array int prime[26]..

iterate through all the characters of the string

i=0;
int product = 1;
while(char[i] != null){
   if(product % prime[i] == 0)
      the character is already present delete it
   else
      product = product*prime[i];
}

this algorithm will work in O(n) time.. with O(1) space requirement It will work well when number of distinct character are less in the string... other wise product will exceed "int" range and we have to handle that case properly

user2409054 · Accepted Answer · 2014-11-28 16:11:13Z

0

int main()    
{    
    std::string s = "aaacabbbccdbdbcd";

    std::set<char> set1;
    set1.insert(s.begin(), s.end());

    for(set<char>::iterator it = set1.begin(); it!= set1.end(); ++it)
    std::cout << *it;

    return 0;
}

std::set takes O(log n) to insert

answered Nov 28, 2014 at 16:11

user2409054

1311 silver badge7 bronze badges

Comments

Mourya Gudladona · Accepted Answer · 2016-09-11 02:31:55Z

O(n) solution:

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

void removeDuplicates(char *);

void removeDuplicates(char *inp)
{
        int i=0, j=0, FLAG=0, repeat=0;

     while(inp[i]!='\0')
    {
        if(FLAG==1)
        {
                inp[i-repeat]=inp[i];
        }
        if(j==(j | 1<<(inp[i]-'\0')))
        {
                repeat++;
                FLAG=1;
        }
                j= j | 1<<(inp[i]-'\0');
                i++;
    }

     inp[i-repeat]='\0';
}

int main()
{
     char inp[100] = "aaAABCCDdefgccc";
    //char inp[100] = "ccccc";
    //char inp[100] = "\0";
    //char *inp = (char *)malloc(sizeof(char)*100);

    printf (" INPUT STRING : %s\n", inp);

     removeDuplicates(inp);

    printf (" OUTPUT STRING : %s:\n", inp);
    return 1;
}

framontb · Accepted Answer · 2019-05-22 09:17:23Z

0

Perhaps the use of built in Python functions are more efficient that those "self made". Like this:

=====================

NOTE: maintain order

CODE

string = "aaabbbccc"

product = reduce((lambda x,y: x if (y in x) else x+y), string)

print product

OUTPUT

abc

=========================

NOTE: order neglected

CODE

string = "aaabssabcdsdwa"

str_uniq = ''.join(set(string))

print str_uniq

OUTPUT

acbdsw

answered May 22, 2019 at 9:17

framontb

2,0771 gold badge17 silver badges34 bronze badges

Comments

Mousa Mohammed · Accepted Answer · 2020-05-03 09:44:52Z

-1

# by using python
def cleantext(word):
    if(len(word)==1):

        return word
    if word[0]==word[1]:

        return cleantext(word[1:])

return word[0]+ cleantext(word[1:])
print(cleantext(word))

answered May 3, 2020 at 9:44

Mousa Mohammed

11 bronze badge

Collectives™ on Stack Overflow

Removing duplicates in a string in Python

19 Answers 19

9 Comments

2 Comments

2 Comments

1 Comment

Comments

Comments

4 Comments

3 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

19 Answers 19

9 Comments

2 Comments

2 Comments

1 Comment

Comments

Comments

4 Comments

3 Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related