0

I have three separate 1-Dimensional Numpy arrays of equal length that I am using as x, y and c parameter inputs to the matplotlib scatter function without a problem. Some of the plot coordinates contained within the x and y arrays are duplicated. Where the coordinates are duplicated, I would like to plot the sum of all the related c parameter (data) values.

Is there a built-in matplotlib way of doing this? Alternatively, I think that I need to remove all the duplicated coordinates from the x and y array and the associated values from the data array. But before doing this, the associated data values must be added to the data array related to the remaining coordinates.

A trivial example is shown below where the duplicated coordinates have been removed and data values added to the one remaining coordinate pair.

Before
x =    np.array([3, 7, 12, 3, 56, 4, 2, 3, 65, 87, 12, 3, 9, 7, 87])
y =    np.array([7, 24, 87, 9, 65, 43, 54, 9, 3, 8, 34, 9, 23, 6, 8])
data = np.array([6, 45, 4, 25, 7, 45, 78, 4, 82, 3, 9, 43, 32, 5, 9])

After
x =    np.array([3, 7, 12, 3, 56, 4, 2, 65, 87, 12, 9, 7])
y =    np.array([7, 24, 87, 9, 65, 43, 54, 3, 8, 34, 23, 6])
data = np.array([6, 45, 4, 72, 7, 45, 78, 4, 12, 9, 32, 5])

I have found an algorithm on Stackoverflow that removes the duplicate coordinates from the x and y arrays in seconds using Python zip and a set. However, my attempt to extend this to the data array took an hour to execute and I don't have the experience to improve on this. The arrays are typically 600,000 elements long.

2
  • To get unique elements of a np array, you can use np.unique(). I am not sure how fast this is compared to set(). And for generating the data array where you sum all the values of the repeating elements, I can only think of using a loop to get the index of the repeating numbers and the corresponding data value and summing it up. Commented Oct 7, 2024 at 20:19
  • It would be helpful if you could add the code you tried plus a link to the stackoverflow thread you mentioned. Commented Oct 8, 2024 at 9:13

1 Answer 1

0

The following attempt is pretty fast even for much larger datasets than the ones you are dealing with. I tested a size of 6_000_000 for x,y and data and it still was finished within about 10s, not using a particularly powerful machine.

What is time consuming, though, is printing of the arrays if they reach a certain size.

import numpy as np

# generating some test data
x = np.random.randint(0, 100_000, 600_000)
y = np.random.randint(0, 100_000, 600_000)
data = np.random.randint(0, 10_000, 600_000)

#initializing the result dict
#set(zip()) make sure we are dealing only with unique x/y pairs
data_tmp = {key: 0 for key in set(zip(x,y))}

# determine sum for each unique x,y pair
for key, val in zip(zip(x,y),data):
    data_tmp[key] += val

# translating the dict to your cleaned up arrays
x_after = [a for a,_ in data_tmp.keys()]
y_after = [b for _,b in data_tmp.keys()]
data_after = data_tmp.values()

As a sidenote:

Checking the code on your example I realized your data seems to be wrong. The second 4 needs to be 82.

Sign up to request clarification or add additional context in comments.

1 Comment

Typo is error on my part. I have tried out the suggested code and can report that it is extremely fast. Under a second for 3 x 600,000 element arrays. I have also used it to filter and plot geographic information on a basemap with expected results.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.