2

Say I have a boolean array

a2= np.array([False, False, True, False, False, True, True, True, False, False])

I want an array which contains the number of elements of each group of True values

Desired result:

np.array([1, 3])

Current solution:

sums = []
current_sum = 0
prev = False
for boo in a2:
    if boo:
        current_sum+=1
        prev = True
    if prev and not boo:
        sums.append(current_sum)
        current_sum = 0
    if not boo:
        prev = False
np.array(sums)

May not be the most computationally efficient. Seems like np.cumsum could be used in a creative manner but I am not able to think of a solution.

3 Answers 3

2

Another way using np.where + np.diff to identify the split locations:

out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]

np.split is slow, so we can replace it with zip in a list comp and walk the over an array of indices. Also, instead of sum, we could index the array and use len:

idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
out = [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]

Output:

[1, 3]

Performance comparison:

import perfplot
import numpy as np
import itertools
import random

def diff_where_split_sum(a2):
    return [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]

def flatnonzero_split_if_sum(a2):
    return [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]

def groupby_if_sum(a2):
    return [sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key]

def diff_where_slice_index_len(a2):
    idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
    return [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]

perfplot.show(
    setup=lambda n: np.array(random.choices([True, False], k=10) * n),
    kernels=[
        lambda arr: diff_where_split_sum(arr),
        lambda arr: flatnonzero_split_if_sum(arr),
        lambda arr: groupby_if_sum(arr),
        lambda arr: diff_where_slice_len(arr)
    ],
    labels=['diff_where_split_sum', 'flatnonzero_split_if_sum', 
            'groupby_if_sum', 'diff_where_slice_index_len'],
    n_range=[2 ** k for k in range(20)],
    equality_check=np.allclose,  
    xlabel='~len(arr)'
)

enter image description here

Sign up to request clarification or add additional context in comments.

5 Comments

Insteead of len(ar[ar]) you can do ar.sum()
I edited my answer with the code for testing the performances. You version is the fastest proposed so far.
@SantoshGupta7 please see the edit. There was a mistake in my answer yesterday that was corrected today. Also there's another method that seems to be much faster than the other options.
@vaeVictis I did a test and it turns out, your version is virtually the same as my version for large arrays and better for small arrays. However, it turns out there's another option that's much faster than all of them.
@enke very nice test! I didn't know perfplot, awesome!
1

You could use list comprehension with np.split + np.flatnonzero:

l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]

Output:

>>> l
[1, 3]

Comments

1

Another solution with itertools

import numpy as np
import itertools

a2= np.array([False, False, True, False, False, True, True, True, False, False])

foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]

print(foo)

output

[1, 3]

Edit:

Since you asked for performance too, I wrote this test code to check the performances of the three codes proposed so far:

#! /usr/bin/env python3
#-*- coding: utf-8 -*-

import itertools
import numpy as np
import timeit

a2 = np.random.choice([True, False], 1000000)


start_time = timeit.default_timer()
out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int))==1)[0]+1)[1:]]
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]
print(timeit.default_timer() - start_time)

print(out == foo == l)

Outputs are always like:

$ python3 test.py 
1.702160262999314
2.1189031369995064
4.760941952999929
True

So, the best solution is the one proposed by enke , followed by mine and then the one proposed by richardec

4 Comments

Perhaps len(list(group)) is more readable than sum( 1 for _ in group )?
@richardec perhaps... we will never know :)
@richardec seriously speaking, I wrote a test for performances (see the edit in my answer). It seems to me that len(list(group)) slows down the procedure. Do you want to test too?
actually, I intended to test it myself but I forgot about it. Nice test :) So, haha, mine's the slowest? :D

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.