2

I'd like to count how many times a big list contains elements in specific order. So for example if i have elements [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5] and i'd like to know how many times [1,2,3] are next to each other (answer is 4 in this case).

I was thinking on checking the indexes of number '3' (so currently it'd return [2,7,12,17]. Then i would iterate over that list, take elements in positions described in the list and check two positions in front of it. If they match '1' and '2' then add 1 to counter and keep looking. I believe this solution isn't really efficient and does not look nice, would there be a better solution?

2
  • Do you need to count overlapping matches as well? For example, how many times is [1, 1] contained in [1, 1, 1]? Commented Feb 25, 2017 at 14:29
  • No, in my case there are no overlapping matches. Commented Feb 27, 2017 at 7:17

6 Answers 6

5

Here's a generalized solution that works for subsequences of any size and for elements of any type. It's also very space-efficient, as it only operates on iterators.

from itertools import islice

def count(lst, seq):
    it = zip(*(islice(lst, i, None) for i in range(len(seq))))
    seq = tuple(seq)
    return sum(x == seq for x in it)
In [4]: count(l, (1, 2, 3))
Out[4]: 4

The idea is to create a sliding window iterator of width len(seq) over lst, and count the number of tuples equal to tuple(seq). This means that count also counts overlapping matches:

In [5]: count('aaa', 'aa')
Out[5]: 2
Sign up to request clarification or add additional context in comments.

2 Comments

Since OP's list seems to have ~6 million elements, it'd probably be a good idea to use islice.
@Rawing: yes, agreed.
2

For lists of ints, you could convert to strings and then use the count method:

>>> x = [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]
>>> y = [1,2,3]
>>> s = ',' + ','.join(str(i) for i in x) + ','
>>> t = ',' + ','.join(str(i) for i in y) + ','
>>> s.count(t)
4

If the items in the list contained strings which contain commas, this method could fail (as @schwobaseggl points out in the comments). You would need to pick a delimiter known not to occur in any of the strings, or adopt an entirely different approach which doesn't reduce to the string count method.

On Edit: I added a fix suggested by @Rawing to address a bug pointed out by @tobias_k . This turns out to be a more subtle problem than it first seems.

10 Comments

That turns [12, 3] and [1, 23] into the same entity.
Would it run in a normal time if my list has ~6 million elements?
@schwobaseggl I never claimed that my solution was a general purpose method for counting arbitrary sublists. It is a perfectly adequate quick-and-dirty approach which will cover most cases, including the example from the question. Still, you raise a good point. Care would have to be taken with the choice of delimiter in the case of lists of strings.
Even for int lists, this could fail if the search pattern is 1,2,3 and the list contains e.g. 11,2,33.
@JohnColeman It's not really a fatal objection. You just need to put commas at the start and the end of both strings.
|
1
x = [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]
y = [1,2,3]

count = 0
for i in range(len(x)-len(y)):
    if x[i:i+len(y)] == y:
        count += 1
print(count)

Comments

0

You could iterate the list and compare sublists:

In [1]: lst = [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]
In [2]: sub = [1,2,3]
In [3]: [i for i, _ in enumerate(lst) if lst[i:i+len(sub)] == sub]
Out[3]: [0, 5, 10, 15]

Note, however, that on a very large list and sublist, this is pretty wasteful, as it creates very many slices of the original list to compare against the sublist. In a slightly longer version, you could use all to compare each of the relevant positions of the list with those of the sublist:

In [5]: [i for i, _ in enumerate(lst) if all(lst[i+k] == e for k, e in enumerate(sub))]
Out[5]: [0, 5, 10, 15]

Comments

0

This strikes me as the longest common subsequence problem repeated every time until the sequence returned is an empty list.

I think that the best that you can do in this case for an efficient algorithm is O(n*m) where n is the number of elements in your big list and m is the number of elements in your small list. You of course would have to have an extra step of removing the small sequence from the big sequence and repeating the process.

Here's the algorithm:

  1. Find lcs(bigList, smallList)
  2. Remove the first occurrence of the smallList from the bigList
  3. Repeat until lcs is an empty list
  4. Return the number of iterations

Here is an implementation of lcs that I wrote in python:

def lcs(first, second):
    results = dict()
    return lcs_mem(first, second, results)

def lcs_mem(first, second, results):
    key = ""
    if first > second:
        key = first + "," + second
    else:
        key = second + "," + first

    if len(first) == 0 or len(second) == 0:
        return ''
    elif key in results:
        return results[key]
    elif first[-1] == second[-1]:
        result = lcs(first[:-1], second[:-1]) + first[-1]
        results[key] = result
        return result
    else:
        lcsLeft = lcs(first[:-1], second)
        lcsRight = lcs(first, second[:-1])

        if len(lcsLeft) > len(lcsRight):
            return lcsLeft
        else:
            return lcsRight

def main():
    pass


if __name__ == '__main__':
    main()

Feel free to modify it to the above algorithm.

Comments

0

One can define the efficiency of solution from the complexity. Search more about complexity of algorithm in google.

And in your case, complexity is 2n where n is number of elements.

Here is the solution with the complexity n, cause it traverses the list only once, i.e. n number of times.

def IsSameError(x,y):
    if (len(x) != len(y)):
        return False

    i = 0
    while (i < len(y)):
        if(x[i] != y[i]):
            return False
        i += 1

    return True

x = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
y = [1, 2, 3]

xLength = len(x)
yLength = len(y)
cnt = 0
answer = []
while (cnt+3 < xLength):
    if(IsSameError([x[cnt], x[cnt+1], x[cnt+2]], y)):
        answer.append(x[cnt])
        answer.append(x[cnt+1])
        answer.append(x[cnt + 2])
        cnt = cnt + 3
    else:
        cnt = cnt + 1


print answer

2 Comments

It might have a lower complexity but it is hard to say that it is better without bench marking it. My reasoning being in Python for loops run at c level speeds while While loops do not. Which means looping with a for loop in Python is faster. Likewise you are creating memory allocation which with your appends, which is also a heavy process. As such the trade in complexity may not be worth it in this scenario.
I don't know the difference between for loop and while loop. Can you please give the reason why it is slower. And if you are comparing my answer with other answer then I can skip the append part and we can also replace while loop with for loop. But yes great knowledge.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.