7

I'm looking for ways to speed up (or replace) my algorithm for grouping data.

I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)

This sounds kind of awkward, so have an example:

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]

# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
#                             *           *

Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).

The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)

Here is my current algorithm:

import numpy as np

def groupify(values):
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
    group_meanings = {}
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):
        this_combo = {}

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
            this_combo[curr_id] = needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        group_meanings[next_hash] = this_combo
        next_hash += 1
    return group, group_meanings

Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.

I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?

4 Answers 4

3

Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -

# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)

# Do the heavy work with np.unique to give us :
#   1. Starting indices of unique elems, 
#   2. Srray that has unique IDs for each element in idx, and 
#   3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)

# Best part happens here : Use mask to ignore the repeated elems and re-tag 
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]

Runtime test

Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -

def groupify(values):  # Original code
    group = np.zeros((len(values[0]),), dtype=np.int64) - 1
    next_hash = 0
    matching = np.ones((len(values[0]),), dtype=bool)
    while any(group == -1):

        matching[:] = (group == -1)
        first_ungrouped_idx = np.where(matching)[0][0]

        for curr_id, value_array in enumerate(values):
            needed_value = value_array[first_ungrouped_idx]
            matching[matching] = value_array[matching] == needed_value
        # Assign all of the found elements to a new group
        group[matching] = next_hash
        next_hash += 1
    return group

def groupify_vectorized(values):  # Proposed code
    arr = np.vstack(values)
    idx = np.ravel_multi_index(arr,arr.max(1)+1)
    _,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
                                        return_inverse=True,return_counts=True)    
    mask = ~np.in1d(unqID,np.where(count>1)[0])
    mask[unq_start_idx] = 1
    return idx[mask].argsort()[unqID]

Runtime results on a list with large arrays -

In [345]: # Input list with random elements
     ...: values = [item for item in np.random.randint(10,40,(10,10000))]

In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True

In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop

In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
Sign up to request clarification or add additional context in comments.

7 Comments

Could you explain a bit more what actually happens here? Especially: what does np.ravel_multi_index do? (The docs aren't very clarifying to me.) Why do you call it on arr.max(1)+1?
@acdr Basically considers each pair as an indexing tuple. So, for the first pair from the sample (10,21) on a 2D grid with an appropriate shape would correspond to one linearly indexed number. Let's say we take a grid of shape (100,100). So, the linear index would be 10*100+21 = 1021. We do this for all pairs in one go with ravel_multi_index. Also, the shape of the 2D grid could be considered as a max of (first and second pair elements) + 1. Hope that made sense. The best I could suggest would be looking into linear indexing as also used in MATLAB.
That makes sense. In that case, that function does all I need. (I don't necessarily need the IDs to start at 0 and increment by 1, as long as they're unique.) I would guess then, though, that this solution wouldn't work for non-integer arrays? (E.g. suppose I have a string array, the values wouldn't map to a grid nicely at all.)
@acdr You can use _,values_elems = np.unique(input_string,return_inverse=True) to give ourselves numeric labels, which would correspond to the elements in values. You might have to use .ravel() and reshaping in between.
Thanks! vstacking the [np.unique(v,return_inverse=True)[1] for v in values] instead of values indeed solves the "other-data-type" problem. I accepted this answer (despite also liking Eelco's) because it's not a separate package, and the timeit results are appreciated.
|
2

This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:

import numpy as np

# Test values:
values = [
    np.array([10, 11, 10, 11, 10, 11, 10]),
    np.array([21, 21, 22, 22, 21, 22, 23]),
    ]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])

# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]

# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)

# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False

# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)

# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
    if np.isnan(labelled[k]):  # this element has not been labelled yet
        labelled[k] = val      # so label it
        labelled[row] = val    # and label every element satisfying the test
        val += 1

print(labelled)
# outputs [ 0.  1.  2.  3.  0.  3.  4.]

It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.

3 Comments

It's a clever solution, and I like it, but this is blowing up for large arrays for me. (If each of my input arrays are of length N, then the val[:, np.newaxis] part creates an N by N array. If I have a million elements, and each boolean value uses up one byte (which it does in numpy I believe) then I'll end up needing a terabyte of ram for this one array. :(
Ah yes, you didn't mention anything about having a memory concern :) There is always a tradeoff between speed and memory efficiency. Your version is more memory conservative, mine should be faster (especially with big data). (Note BTW that it's not the val[:,np.newaxis] that results in the memory allocation (simply a broadcasting operation), but the subsequent == comparison which forces the broadcasted arrays to actually expand. Not to be pedantic, but perhaps this points to further possible optimizations.)
Actually if memory is a concern I suspect your current algorithm is close to optimal. Perhaps you could do something with np.unique? It supports an option return_inverse, might be useful.
1

The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:

import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)

The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:

unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)

Comments

-1

If I understand correctly, you are trying to hash values according to columns. Its better to convert the columns into arbitrary values by themselves, and then find the hashes from them.

So you actually want to hash on list(np.array(values).T).

This functionality is already built into Pandas. You dont need to write it. The only problem is that it takes a list of values without further lists within it. In this case, you can just convert the inner list to string map(str, list(np.array(values).T)) and factorize that!

>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
 array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))

I have converted your list of arrays into an array, and then into a string ...

3 Comments

As far as I could tell, pandas.factorize only works on 1-d arrays. You converted each "row" into a string, but in the case of large arrays, that's going to be tremendously slow.
That is correct. However, I am unsure if there is any implementation in raw Python that will beat Pandas. Maybe we can use cython to speed up raw Python code. It all comes down to how big the actual list of array is going to be ...
Depends on how many unique combinations there are compared to how big each array is. For very large arrays with few combinations, my solution will beat yours. (Just copy my input arrays a million times. You're going to have a million calls to str.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.