2

I want to sort an array based on the values of two columns and a condition on the third. This is my array:

my_array = np.array([[1., 2., 5.1],
                     [1., 1., 5.],
                     [2., 2., 2.],
                     [2., 1., 2.],
                     [2., 2., 5.],
                     [3., 2., 2.5],
                     [3., 1., 2.5],
                     [2., 1., 5.]])

I must sort it based on the first and second column and also a condition based on the third column. I tried this method:

my_sorted_array = my_array[np.lexsort((my_array[:, 1], my_array[:, 0]))]

but it does not regard my third column. It gives me:

result = np.array([[1. , 1. , 5. ],
                   [1. , 2. , 5.1 ],
                   [2., 1. , 2. ],
                   [2., 1. , 5. ],
                   [2. , 2. , 2. ],
                   [2. , 2. , 5. ],
                   [3. , 1. , 2.5 ],
                   [3. , 2. , 2.5 ]])

I want to have the following output:

my_sorted_array = np.array([[1., 1., 5.],
                            [1., 2., 5.1],
                            [2., 1., 5.],
                            [2., 2., 5.],
                            [2., 1., 2.],
                            [2., 2., 2.],
                            [3., 1., 2.5],
                            [3., 2., 2.5]])

I also tried to set coefficients to the third column using this method:

sort_func = my_array[:, 0] * c1 + my_array[:, 1] * c2 + my_array[:, 2] * c3 # c1, c2 and c3 are coefficient
sort_index = np.argsort(sort_func)

This method also is time consuming because I should tune coefficinet for each new data set. Is it possible to put an if_condition in the sorting? How can I rearrange the result into my_sorted_array? Data of two first columns are always gentle and regular (they x and y of a regular grid).

To make it more visual, I uploaded a figure here. The figure shows the trend I want to use to sort my data.

enter image description here

5
  • Can you please explain what the exact sort order is? How would you use the data to compare two rows manually? Commented Nov 26, 2020 at 11:43
  • Dear @Mad Physicist, My data are cordinates (x,y and z) I want to sort them firstly based on x and then y. I have a regular grid (x, y) of z data. My problem is that in case some data with lower z values have also low x values and emerge among the values having high z values. I will uplad a photo of my real data to show how I like to sort them. Commented Nov 26, 2020 at 11:50
  • Explain the logic. Figure out now to express your criteria precisely instead of stabbing in the dark. You are clearly not just sorting by x and y Commented Nov 26, 2020 at 17:08
  • Dear @Mad Physicist , The logic is that firstly sort by x, then by y, but also regard exception, when the z value is changed a lot. If I can separate my data into upper and lower ones, it may be easier to sort each set separately and then merging them. For example, in my fig points numbered 1, 2, 3 and 4 be one set and 5, 6, 7 and 8 as another one. Then, sorting each set separately and finally merging them again. The main issue is that I am working with real natural data and they are really choatic. Commented Nov 26, 2020 at 17:13
  • 1
    Here is what I think you are doing: you are thresholding z, and sorting in order (z > thresh, x, y). The first item is a boolean mask, and you can use z.mean as the threhsold for now. Is that close enough? Commented Nov 26, 2020 at 17:17

1 Answer 1

2

Based on our discussion, you want to sort first on some categorical function of z, then x, then y. So in general, something very similar to your original code, rearranged ever so slightly for clarity:

x, y, z = my_array.T
index = np.lexsort((y, x, f(z)))
my_sorted_array = my_array[index, :]

For the simplest case, you can make f(x) return a boolean mask. This would work perfectly for your toy example, since it splits the data into two categories. Keep in mind that points labeled False will come before those labeled True, and apply your threshold accordingly:

def f(z):
    return z < z.mean()

But why stop at two categories? You can use np.digitize to split the data into an arbitrary number of labels:

breaks = [2.25, 4.9]

def f(z):
    return np.digitize(z, np.sort(breaks)[::-1])

Even more specifically, if your data has a pair of approximately equal clumps, you can isolate the gunk in between with something like this:

def f(z):
    mask0 = z > z.mean()
    z1 = z[mask0]
    z2 = z[~mask0]
    breaks = [z1.mean() - 3 * z1.std(), z2.mean() + 3 * z2.std()]
    return np.digitize(z, breaks)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.