Most computationally efficient way to count consecutive repeating values

Question

Say I have a boolean array

a2= np.array([False, False, True, False, False, True, True, True, False, False])

I want an array which contains the number of elements of each group of True values

Desired result:

np.array([1, 3])

Current solution:

sums = []
current_sum = 0
prev = False
for boo in a2:
    if boo:
        current_sum+=1
        prev = True
    if prev and not boo:
        sums.append(current_sum)
        current_sum = 0
    if not boo:
        prev = False
np.array(sums)

May not be the most computationally efficient. Seems like np.cumsum could be used in a creative manner but I am not able to think of a solution.

score 2 · Accepted Answer · 2022-02-03 06:58:10Z

2

Another way using np.where + np.diff to identify the split locations:

out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]

np.split is slow, so we can replace it with zip in a list comp and walk the over an array of indices. Also, instead of sum, we could index the array and use len:

idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
out = [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]

Output:

[1, 3]

Performance comparison:

import perfplot
import numpy as np
import itertools
import random

def diff_where_split_sum(a2):
    return [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]

def flatnonzero_split_if_sum(a2):
    return [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]

def groupby_if_sum(a2):
    return [sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key]

def diff_where_slice_index_len(a2):
    idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
    return [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]

perfplot.show(
    setup=lambda n: np.array(random.choices([True, False], k=10) * n),
    kernels=[
        lambda arr: diff_where_split_sum(arr),
        lambda arr: flatnonzero_split_if_sum(arr),
        lambda arr: groupby_if_sum(arr),
        lambda arr: diff_where_slice_len(arr)
    ],
    labels=['diff_where_split_sum', 'flatnonzero_split_if_sum', 
            'groupby_if_sum', 'diff_where_slice_index_len'],
    n_range=[2 ** k for k in range(20)],
    equality_check=np.allclose,  
    xlabel='~len(arr)'
)

edited Feb 3, 2022 at 6:58

answered Feb 2, 2022 at 1:44

user7864386

Sign up to request clarification or add additional context in comments.

5 Comments

user17242583 Over a year ago

Insteead of len(ar[ar]) you can do ar.sum()

vaeVictis Over a year ago

I edited my answer with the code for testing the performances. You version is the fastest proposed so far.

user7864386 Over a year ago

@SantoshGupta7 please see the edit. There was a mistake in my answer yesterday that was corrected today. Also there's another method that seems to be much faster than the other options.

user7864386 Over a year ago

@vaeVictis I did a test and it turns out, your version is virtually the same as my version for large arrays and better for small arrays. However, it turns out there's another option that's much faster than all of them.

vaeVictis Over a year ago

@enke very nice test! I didn't know perfplot, awesome!

user17242583 · Accepted Answer · 2022-02-02 01:34:23Z

1

You could use list comprehension with np.split + np.flatnonzero:

l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]

Output:

>>> l
[1, 3]

answered Feb 2, 2022 at 1:34

user17242583

Comments

vaeVictis · Accepted Answer · 2022-02-02 17:13:25Z

1

Another solution with itertools

import numpy as np
import itertools

a2= np.array([False, False, True, False, False, True, True, True, False, False])

foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]

print(foo)

output

[1, 3]

Edit:

Since you asked for performance too, I wrote this test code to check the performances of the three codes proposed so far:

#! /usr/bin/env python3
#-*- coding: utf-8 -*-

import itertools
import numpy as np
import timeit

a2 = np.random.choice([True, False], 1000000)


start_time = timeit.default_timer()
out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int))==1)[0]+1)[1:]]
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]
print(timeit.default_timer() - start_time)

print(out == foo == l)

Outputs are always like:

$ python3 test.py 
1.702160262999314
2.1189031369995064
4.760941952999929
True

So, the best solution is the one proposed by enke , followed by mine and then the one proposed by richardec

edited Feb 2, 2022 at 17:13

answered Feb 2, 2022 at 1:42

vaeVictis

5021 gold badge3 silver badges15 bronze badges

4 Comments

user17242583 Over a year ago

Perhaps len(list(group)) is more readable than sum( 1 for _ in group )?

vaeVictis Over a year ago

@richardec perhaps... we will never know :)

vaeVictis Over a year ago

@richardec seriously speaking, I wrote a test for performances (see the edit in my answer). It seems to me that len(list(group)) slows down the procedure. Do you want to test too?

user17242583 Over a year ago

actually, I intended to test it myself but I forgot about it. Nice test :) So, haha, mine's the slowest? :D

Collectives™ on Stack Overflow

Most computationally efficient way to count consecutive repeating values

3 Answers 3

5 Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related