Time complexity calculation for my algorithm

Question

Given a string, find the first non-repeating character in it and return its index. If it doesn't exist, return -1. You may assume the string contain only lowercase letters.

I'm going to define a hash that tracks the occurrence of characters. Traverse the string from left to right, check if the current character is in the hash, continue if yes, otherwise in another loop traverse the rest of the string to see if the current character exists. Return the index if not and update the hash if it exists.

def firstUniqChar(s):

    track = {}
    for index, i in enumerate(s):
        if i in track:
            continue
        elif i in s[index+1:]: # For the last element, i in [] holds False
            track[i] = 1
            continue
        else:
            return index
    return -1

firstUniqChar('timecomplexity')

What's the time complexity (average and worst) of my algorithm?

find the first non-repeating character - should this mean e.g. for abbacdec the algorithm would return a (it doesn't repeat) or d (it exists once only)? — miraculixx
– miraculixx, Commented Aug 23, 2016 at 4:18
Non-repeating means exists once only@miraculixx In your case, return 5 — Nicholas
– Nicholas, Commented Aug 23, 2016 at 4:27

Scott Mermelstein · Accepted Answer · 2016-08-23 14:51:18Z

8

Your algorithm has time complexity of O(kn) where k is the number of unique characters in the string. If k is a constant then it is O(n). As the problem description clearly bounds the number of alternatives for elements ("assume lower-case (ASCII) letters"), thus k is constant and your algorithm runs in O(n) time on this problem. Even though n will grow to infinite, you will only make O(1) slices of the string and your algorithm will remain O(n). If you removed track, then it would be O(n²):

In [36]: s = 'abcdefghijklmnopqrstuvwxyz' * 10000

In [37]: %timeit firstUniqChar(s)
100 loops, best of 3: 18.2 ms per loop

In [38]: s = 'abcdefghijklmnopqrstuvwxyz' * 20000

In [37]: %timeit firstUniqChar(s)
10 loops, best of 3: 36.3 ms per loop

In [38]: s = 'timecomplexity' * 40000 + 'a'

In [39]: %timeit firstUniqChar(s)
10 loops, best of 3: 73.3 ms per loop

It pretty much holds there that the T(n) is still of O(n) complexity - it scales exactly linearly with number of characters in the string, even though this is the worst-case scenario for your algorithm - there is no single character that is be unique.

I will present a not-that efficient, but simple and smart method here; count the character histogram first with collections.Counter; then iterate over the characters finding the one

from collections import Counter
def first_uniq_char_ultra_smart(s):
    counts = Counter(s)
    for i, c in enumerate(s):
        if counts[c] == 1:
            return i

    return -1

first_uniq_char('timecomplexity')

This has time complexity of O(n); Counter counts the histogram in O(n) time and we need to enumerate the string again for O(n) characters. However in practice I believe my algorithm has low constants, because it uses a standard dictionary for Counter.

And lets make a very stupid brute-force algorithm. Since you can assume that the string contains only lower-case letters, then use that assumption:

import string
def first_uniq_char_very_stupid(s):
    indexes = []
    for c in string.ascii_lowercase:
        if s.count(c) == 1:
            indexes.append(s.find(c))

    # default=-1 is Python 3 only
    return min(indexes, default=-1)

Let's test my algorithm and some algorithms found in the other answers, on Python 3.5. I've chosen a case that is pathologically bad for my algorithm:

In [30]: s = 'timecomplexity' * 10000 + 'a'

In [31]: %timeit first_uniq_char_ultra_smart(s)
10 loops, best of 3: 35 ms per loop

In [32]: %timeit karin(s)
100 loops, best of 3: 11.7 ms per loop

In [33]: %timeit john(s)
100 loops, best of 3: 9.92 ms per loop

In [34]: %timeit nicholas(s)
100 loops, best of 3: 10.4 ms per loop

In [35]: %timeit first_uniq_char_very_stupid(s)
1000 loops, best of 3: 1.55 ms per loop

So, my stupid algorithm is the fastest, because it finds the a at the end and bails out. And my smart algorithm is slowest, One more reason for bad performance of my algorithm besides this being worst case is that OrderedDict is written in C on Python 3.5, and Counter is in Python.

Let's make a better test here:

In [60]: s = string.ascii_lowercase * 10000

In [61]: %timeit nicholas(s)
100 loops, best of 3: 18.3 ms per loop

In [62]: %timeit karin(s)
100 loops, best of 3: 19.6 ms per loop

In [63]: %timeit john(s)
100 loops, best of 3: 18.2 ms per loop

In [64]: %timeit first_uniq_char_very_stupid(s)
100 loops, best of 3: 2.89 ms per loop

So it appears that the "stupid" algorithm of mine isn't all that stupid at all, it exploits the speed of C while minimizing the number of iterations of Python code being run, and wins clearly in this problem.

edited Aug 23, 2016 at 14:51

Scott Mermelstein

15.4k4 gold badges51 silver badges77 bronze badges

answered Aug 23, 2016 at 5:18

Antti Haapala

135k23 gold badges297 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

John Zwinck Over a year ago

Ooh, nice. I like this one best so far.

Antti Haapala Over a year ago

@JohnZwinck Actually this was slower than Karin's, funny, I thought OrderedDict would be slower.

Nicholas Over a year ago

Interesting, I am now still surprised my algorithm is faster than yours in your case and some other s I assigned (given that Counter is written in Python as you said). I suggest you also time my code in Jupyter. Actually, I was pretty sure that the worst case of my algorithm is O(n^2) but what I was more interested is the average time complexity (seems nobody's answered explicitly) since I found the problem here is very case-sensitive, i.e., it seems to depend much on s.

Antti Haapala Over a year ago

Mine could really be faster by a factor of 3-4 if only the Counter was not implemented as it is implemented now.

Antti Haapala Over a year ago

@NicholasLiu edited my answer, your algorithm runs in O(n) time.

|

Karin · Accepted Answer · 2016-08-23 15:26:45Z

5

~~As others have noted, your algorithm is O(n²) due to nested linear search.~~ As discovered by @Antti, the OP's algorithm is linear and bound by O(kn) for k as the number of all possible lowercase letters.

My proposition for an O(n) solution:

from collections import OrderedDict

def first_unique_char(string):
    duplicated = OrderedDict()  # ordered dict of char to boolean indicating duplicate existence
    for s in string:
        duplicated[s] = s in duplicated

    for char, is_duplicate in duplicated.items():
        if not is_duplicate:
            return string.find(char)
    return -1

print(first_unique_char('timecomplexity'))  # 4

edited Aug 23, 2016 at 15:26

answered Aug 23, 2016 at 4:50

Karin

8,63027 silver badges35 bronze badges

12 Comments

Antti Haapala Over a year ago

Yours returns the index, not the character though :)

Karin Over a year ago

Woah you're right - I read too far into the function name -_- Eww this will make it not as nice :(

Antti Haapala Over a year ago

Actually, your algorithm is so fast that you could just return string.find(char)

Karin Over a year ago

Haha thanks for the save! I just really wanted to keep the duplicated[s] = s in duplicated and was almost heartbroken!

user6732794 Over a year ago

Do you need string.find? Why not enumerate the values of duplicated and return the index of the first False value?

|

Antti Haapala · Accepted Answer · 2016-08-23 07:38:08Z

4

Your algorithm is O(n²), because you have a "hidden" iteration over a slice of s inside the loop over s.

A faster algorithm would be:

def first_unique_character(s):
    good = {} # char:idx
    bad = set() # char
    for index, ch in enumerate(s):
        if ch in bad:
            continue
        if ch in good: # new repeat
            bad.add(ch)
            del good[ch]
        else:
            good[ch] = index

    if not good:
        return -1

    return min(good.values())

This is O(n) because the in lookups use hash tables, and the number of distinct characters should be much less than len(s).

edited Aug 23, 2016 at 7:38

Antti Haapala

135k23 gold badges297 silver badges349 bronze badges

answered Aug 23, 2016 at 4:03

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

11 Comments

miraculixx Over a year ago

better but not quite O(n) yet because sorted is O(n log n) as per Python TimeComplexity

ShadowRanger Over a year ago

Replace sorted(good.values())[0] with min(good.values()) (use good.viewvalues() on Python 2.x to avoid temporary lists) and it's back to O(n). Mind you, the sorted step was O(n log n) in terms of number of non-repeated values, which is presumably much smaller than the total string size (which is the n in the other big-Os), so it's probably not huge, but then, min does what you want directly with work that's guaranteed to be irrelevant to the big-O cost (since the main algorithm is O(n) on string length, and the min step is O(n) on a smaller n).

miraculixx Over a year ago

It might still end up being O(n^2) though because of the in bad and in good searches

ShadowRanger Over a year ago

@miraculixx: Python's set and dict are implemented as hash tables; yes, the worst case lookup cost for a hash table is O(n), but actual hash collisions are impossible for single characters in Python (the full Unicode space is <21 bits in size right now, and I believe Python's hash codes are at least 32 bits on all builds, and 64 bits on 64 bit builds), so the collisions would only be bucket collisions with mismatched hashes, and as the dict grew, rehashing would tend to make collisions at one size disappear at larger sizes. In practice, the lookups would be O(1).

ShadowRanger Over a year ago

So really, you're right that O(n^2) is technically the worst case scenario, but it would be hard as hell if not impossible to trigger it, and it's usually not worth worrying about that possibility in Python, particularly modern Python 3 (where strings above a certain length use a cryptographically secure hashing algorithm with a per-interpreter-launch key, which means keys that collide in one session won't collide in another session, and renders DoS via intentionally triggering hash collisions a likely non-issue). Just treat dict and set as "O(1), but with a higher constant factor".

|

Collectives™ on Stack Overflow

Time complexity calculation for my algorithm

3 Answers 3

6 Comments

12 Comments

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

12 Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related