2

I have some (a lot) binary encoded vectors like:

[0, 1, 0, 0, 1, 0] #But with many more elements each one

and they are all stored into a numpy (2D) array like:

[
 [0, 1, 0, 0, 1, 0],
 [0, 0, 1, 0, 0, 1],
 [0, 1, 0, 0, 1, 0],
]

I want to get a frequency table of each label set. So, in this example, the frequency table will be:

[2,1] 

Because the 1st label set has two appearances and the 2nd label set just one.

In other words, I want to implement itemfreq from Scipy or histogram from numpy, but not for single elements but for lists.

Now I have the following code implemented:

def get_label_set_freq_table(labels):
    uniques = np.empty_like(labels)
    freq_table = np.zeros(shape=labels.shape[0])
    equal = False

    for idx,row in enumerate(labels):
        for lbl_idx,label_set in enumerate(uniques):
            if np.array_equal(row,label_set):
                equal = True
                freq_table[lbl_idx] += 1
                break
        if not equal:
            uniques[idx] = row
            freq_table[idx] += 1
        equal = False

    return freq_table

being labels the binary encoded vectors.

It works well, but it's extremly low when the number of vectors is big (>58.000) and the number of elements in each one is also big (>8.000)

How can this be done in a more efficient way?

2
  • That doesn't look one-hot to me. Commented Jan 5, 2018 at 15:38
  • You are right, I'll edit the question to "binary" vectors. Thanks. Also @Divakar is right with the same appreciation. Commented Jan 5, 2018 at 15:58

1 Answer 1

2

I am assuming you meant an array with 1s and 0s only. For those, we can reduce each row to a scalar with binary scaling and then use np.unique -

In [52]: a
Out[52]: 
array([[0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]])

In [53]: s = 2**np.arange(a.shape[1])

In [54]: a1D = a.dot(s)

In [55]: _, start, count = np.unique(a1D, return_index=1, return_counts=1)

In [56]: a[start]
Out[56]: 
array([[0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 1]])

In [57]: count
Out[57]: array([2, 1])

Here's a generalized one -

In [33]: unq_rows, freq = np.unique(a, axis=0, return_counts=1)

In [34]: unq_rows
Out[34]: 
array([[0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0]])

In [35]: freq
Out[35]: array([1, 2])
Sign up to request clarification or add additional context in comments.

1 Comment

I forgot about the axis parameter... Wow! Your solution is great! So efficient and so elegant! Thank you very much! Checked and works like a charm, accepted answer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.