How to sort a NumPy array of strings by the last column

Question

Is there a way to sort the rows of an array by the last element, in this case the cell ids. The cell id is build as follows : "CellID_NumberOfCell

arr =np.array([['65.0','30.0','20.0','0.0','0_0'],
 ['2.0','29.0','24.0','0.0','1_0'],
 ['0.0','18.0','4.0','0.0','2_0'],
 ['16.0','9.0','0.0','9990.0','7_203'],
 ['16.0','9.0','0.0','9990.0','0_203'],
 ['20.0','23.0','31.0','9990.0','8_158'],
 ['65.0','30.0','20.0','0.0','0_10']])

So after sorting it should look like:

arr =np.array([['65.0','30.0','20.0','0.0','0_0'],
 ['65.0','30.0','20.0','0.0','0_10'],
 ['16.0','9.0','0.0','9990.0','0_203'],
 ['2.0','29.0','24.0','0.0','1_0'],
 ['0.0','18.0','4.0','0.0','2_0'],
 ['16.0','9.0','0.0','9990.0','7_203'],
 ['20.0','23.0','31.0','9990.0','8_158']])

EDIT:

Is it also possible to delete the numbers after the underscore after sorting?. So that i just have the ID. Instead of 0_0 just 0.

EDIT2

After sorting the ID, it should also sort after time, so that every ID with 0 for example should also be sorted after time 0,1...9999 etc.

Edit the question title to reflect it's intention; something like, "How to sort a NumPy array by the last element of each row?" — Ébe Isaac
– Ébe Isaac, Commented Jun 7, 2017 at 11:07
@Varlor Use an input with : arr[np.random.randint(0,arr.shape[0],(1000))] to test out all approaches? You may vary that 1000 there. — Divakar
– Divakar, Commented Jun 7, 2017 at 11:39
@Divakar Hey, how does this function work? Does it analyze the structure of my input and generates 1000 randomes in the shape of it? — Varlor
– Varlor, Commented Jun 7, 2017 at 11:51
@Varlor Basically gets those rows off arr in random order with repeats and gets us a (1000,5) shaped array. — Divakar
– Divakar, Commented Jun 7, 2017 at 11:52

P. Camilleri · Accepted Answer · 2017-06-07 11:27:27Z

5

np.argsort(arr[:, -1]) will give you the permutation so that elements of the last column of arr are ordered.

Then, arr[np.argsort(arr[:, -1])] reorders the rows of arr according to this permutation.

Beware that the lexicographic order is used since your data consists of string, so 0_10 comes before 0_2. If this is not what you want, you should split the last column, and I advise you to use a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame(arr)
df['Cell'], df['CellIndex'] = df[df.columns[-1]].str.split('_', 1).str
df['Cell'] = df['Cell'].astype(int)
df['CellIndex'] = df['CellIndex'].astype(int)
df.sort_values(['Cell', 'CellIndex'])

pandas is really the way to go to manipulate this kind of data.

edited Jun 7, 2017 at 11:27

answered Jun 7, 2017 at 11:02

P. Camilleri

13.3k10 gold badges49 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

P. Camilleri Over a year ago

Is arr really a numpy array ? What is type(arr) ? Try arr = np.array(arr)

Varlor Over a year ago

yeah, was a mistake of me. But the next problem is that the output now is pandas data frame. is it possible to cast it back to numpy array? :)

Varlor Over a year ago

I added an edit. Is it also possible to do that after sorting?

P. Camilleri Over a year ago

@Varlor arr = np.array(df). Pandas relies heavily on numpy :)

Divakar · Accepted Answer · 2017-06-07 13:33:26Z

2

We need to split the last column by that underscore, lexsort it and then use those indices to sort the input array.

Thus, an implementation would be -

def numpy_app(arr):
    # Extract out the strings on last column split based on '_'.
    # Thus, for given sample we would have the last column would be
    # split further into 3 columns, the middle one being of '_''s.
    a = np.core.defchararray.partition(arr[:,-1],'_')

    # Lexsort it on the last numeric cols (0,2). We need to flip
    # the order of columns to give precedence to the first string
    sidx = np.lexsort(a[:,2::-2].astype(int).T)

    # Index into input array with lex-sorted indices for final o/p
    return arr[sidx]

Based on the edits in the question, it seems we want to cut out the string after the underscore. To do so, here's a modified version -

def numpy_cut_app(arr):
    a = np.core.defchararray.partition(arr[:,-1],'_')
    sidx = np.lexsort(a[:,2::-2].astype(int).T)
    out = arr[sidx]

    # Replace the last column with the first string off the last column's split one
    out[:,-1] = a[sidx,0]
    return out

Based on more edits, it seems we want to include the fourth column into lex-sorting and neglect everything after the underscore in the last column. So, a further modified version would be -

def numpy_cut_col3_app(arr):
    a = np.core.defchararray.partition(arr[:,-1],'_')

    # Lex-sort using first off the split strings from last col(precedence to it)
    # and col-3 of input array
    sidx = np.lexsort([arr[:,3].astype(float), a[:,0]])
    out = arr[sidx]
    out[:,-1] = a[sidx,0]
    return out

Sample runs -

In [567]: arr
Out[567]: 
array([['65.0', '30.0', '20.0', '0.0', '9_49'],
       ['2.0', '29.0', '24.0', '0.0', '1_0'],
       ['0.0', '18.0', '4.0', '0.0', '2_0'],
       ['16.0', '9.0', '0.0', '9990.0', '7_203'],
       ['16.0', '9.0', '0.0', '9990.0', '9_5'],
       ['20.0', '23.0', '31.0', '9990.0', '8_158'],
       ['65.0', '30.0', '20.0', '0.0', '9_50']], 
      dtype='|S6')

In [568]: numpy_app(arr)
Out[568]: 
array([['2.0', '29.0', '24.0', '0.0', '1_0'],
       ['0.0', '18.0', '4.0', '0.0', '2_0'],
       ['16.0', '9.0', '0.0', '9990.0', '7_203'],
       ['20.0', '23.0', '31.0', '9990.0', '8_158'],
       ['16.0', '9.0', '0.0', '9990.0', '9_5'],
       ['65.0', '30.0', '20.0', '0.0', '9_49'],
       ['65.0', '30.0', '20.0', '0.0', '9_50']], 
      dtype='|S6')

In [569]: numpy_cut_app(arr)
Out[569]: 
array([['2.0', '29.0', '24.0', '0.0', '1'],
       ['0.0', '18.0', '4.0', '0.0', '2'],
       ['16.0', '9.0', '0.0', '9990.0', '7'],
       ['20.0', '23.0', '31.0', '9990.0', '8'],
       ['16.0', '9.0', '0.0', '9990.0', '9'],
       ['65.0', '30.0', '20.0', '0.0', '9'],
       ['65.0', '30.0', '20.0', '0.0', '9']], 
      dtype='|S6')

edited Jun 7, 2017 at 13:33

answered Jun 7, 2017 at 11:09

Divakar

222k19 gold badges273 silver badges374 bronze badges

11 Comments

Varlor Over a year ago

Nice! The problem here now is sth like here: ['10.0' '33.0' '14.0' '2505.0' '9_49'] ['1.0' '12.0' '15.0' '180.0' '9_5'] ['12.0' '3.0' '15.0' '2520.0' '9_50']. 5 is sorted between 49 and 50.

Varlor Over a year ago

I added an edit. Is it also possible to do that after sorting?

Divakar Over a year ago

@Varlor Updated. Fixed that 5, 49, 50 sorting issue.

Varlor Over a year ago

Ok thank you very much!!! Unfortunately i made a mistake in my question. It should be sorted by the ID like you did it, but also after the time(column 3). So the output of your test should be: array([['2.0', '29.0', '24.0', '0.0', '1'], ['0.0', '18.0', '4.0', '0.0', '2'], ['16.0', '9.0', '0.0', '9990.0', '7'], ['20.0', '23.0', '31.0', '9990.0', '8'], ['65.0', '30.0', '20.0', '0.0', '9'], ['65.0', '30.0', '20.0', '0.0', '9']], ['16.0', '9.0', '0.0', '9990.0', '9'], dtype='|S6')

Divakar Over a year ago

@Varlor By third col, do you mean arr[:,3] or arr[:,2]?

|

Tbaki · Accepted Answer · 2017-06-07 12:40:43Z

2

You can do it easely with sorted and lambda function and as suggested by @Divakar to get the numpy array back:

np.array(sorted(arr, key=lambda x :x[-1]))

output

[['65.0', '30.0', '20.0', '0.0', '0_0'],
['65.0', '30.0', '20.0', '0.0', '0_10'],
['16.0', '9.0', '0.0', '9990.0', '0_203'],
['2.0', '29.0', '24.0', '0.0', '1_0'],
['0.0', '18.0', '4.0', '0.0', '2_0'],
['16.0', '9.0', '0.0', '9990.0', '7_203'],
['20.0', '23.0', '31.0', '9990.0', '8_158']]

EDIT : you can do it by using this, not pretty, but does the work

np.array([ np.append(i[:-1],i[-1].split("_")[0]) for i in sorted(list(arr), key=lambda x :x[-1])])

ouput

array([['65.0', '30.0', '20.0', '0.0', '0'],
       ['65.0', '30.0', '20.0', '0.0', '0'],
       ['16.0', '9.0', '0.0', '9990.0', '0'],
       ['2.0', '29.0', '24.0', '0.0', '1'],
       ['0.0', '18.0', '4.0', '0.0', '2'],
       ['16.0', '9.0', '0.0', '9990.0', '7'],
       ['20.0', '23.0', '31.0', '9990.0', '8']], 
      dtype='<U6')

edited Jun 7, 2017 at 12:40

answered Jun 7, 2017 at 11:17

Tbaki

1,0037 silver badges12 bronze badges

5 Comments

Varlor Over a year ago

If i use your approach i got something like this (arrays in array): [array(['65.0', '30.0', '20.0', '0.0', '0_0'], dtype='|S6'), array(['65.0', '30.0', '20.0', '0.0', '0_10'], dtype='|S6'), array(['16.0', '9.0', '0.0', '9990.0', '0_203'], dtype='|S6'), array(['2.0', '29.0', '24.0', '0.0', '1_0'], dtype='|S6'), array(['0.0', '18.0', '4.0', '0.0', '2_0'], dtype='|S6'), array(['16.0', '9.0', '0.0', '9990.0', '7_203'], dtype='|S6'), array(['20.0', '23.0', '31.0', '9990.0', '8_158'], dtype='|S6')]

Tbaki Over a year ago

@valor wasn't that the original format ? If not can you provide a way for people to reproduce your data ? if i enter your line arr =..., i get TypeError: list indices must be integers or slices, not str, so i assumed it was a nested list

Divakar Over a year ago

@Varlor Use np.array() to get back an array output.

Varlor Over a year ago

I added an edit. Is it also possible to do that after sorting?

Tbaki Over a year ago

@Varlor Good. : ) If everything is working, can you accept one of the asnwer that fit best your need ?

Collectives™ on Stack Overflow

How to sort a NumPy array of strings by the last column

3 Answers 3

4 Comments

11 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

11 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related