2

I have 100 large arrays > 250,000 elements each. I want to find common values that are found in these arrays. I know that there are not going to be values that are found in all 100 arrays, but a small number values will be found in multiple arrays (I suspect 10-30%). I want to find which values are found with the highest frequency across these arrays. (Side point: arrays have no duplicates)

I know that I can loop through the arrays and eventually find them, but that will take a while. I also know about the np.intersect1d function, but I that only gives values that are found within all of the arrays, whereas I'm looking for values that are only going to be in around 20 of the 100 arrays.

My best bet is use the np.intersect1d function and loop through all possible combinations of the arrays, which would definitely take a while, but not as long as simply looping through all 250,000 x 100 values. Example:

array_1 = array([1.98,2.33,3.44,,...11.1)
array_2 = array([1.26,1.49,4.14,,...9.0)
array_2 = array([1.58,2.33,3.44,,...19.1)
array_3 = array([4.18,2.03,3.74,,...12.1)
.
.
. 
array_100= array([1.11,2.13,1.74,,...1.1)

No values in all 100, Is there a value that can be found in 30 different arrays?

2
  • Are all the arrays the same size? Can you have one large 250k x 100 array? Commented Oct 24, 2018 at 3:37
  • No they are not, usually range from 220,000-280,000 Commented Oct 24, 2018 at 3:39

1 Answer 1

1

You can either use np.unique with the return_counts keyword, or a vanilla Python Counter.

The first option works if you can concatenate your arrays into a single 250k x 100 monolith, or even string them out over after the other:

unq, counts = np.unique(monolith, return_counts=True)
ind = np.argsort(counts)[::-1]
unq = unq[ind]
counts = counts[ind]

This will leave you with an array containing all the unique values, and the frequency with which they occur.

If the arrays have to remain separate, use collections.Counter to accomplish the same task. In the following, I assume that you have a list containing your arrays. It would be very pointless to have a hundred individually named variables:

c = Counter() for arr in arrays: c.update(arr)

Now c.most_common will give you the most common elements and their counts.

Sign up to request clarification or add additional context in comments.

3 Comments

Great Idea! Thank you for your response, I'll give it a try!
@Danny. Updated
The proper way to thank is by selecting the answer by clicking on the check mark next to it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.