Numpy unique changes integer to string

Question

I have data table which has string and integer columns such as:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

I need unique rows, therefore I used numpy unique function:

summary, repeat = np.unique(test_data,return_counts=True, axis=0)

But after then my data types are changed. Summary is:

array([['A', '1', '2', '3'],
   ['B', '4', '5', '6']], dtype='<U1')

All data types are now string. How can I prevent this change? (Python 3.7, numpy 1.16.4)

You cannot store multiple different data types in an array. ut since all of these can be chars, python will automatically assume char and convert them. If you stored them seperate you would get an integer array out of the ints. — jaSnom
– jaSnom, Commented Sep 1, 2020 at 10:40
@juanpa.arrivillaga yes you are correct. working both java nad python at my job makes me mix them up sometimes. — jaSnom
– jaSnom, Commented Sep 1, 2020 at 10:57
@juanpa.arrivillaga. technically you can have a recarray or an object array with tuples — Mad Physicist
– Mad Physicist, Commented Sep 1, 2020 at 12:09
@MadPhysicist sure, but technically, those are still storing a homogenous data-type, the structured dtype, or object :) — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Sep 1, 2020 at 12:15

Mad Physicist · Accepted Answer · 2020-09-01 12:14:45Z

4

If you have python objects and you want to retain them as python objects, use python functions:

unique_rows = set(test_data)

Or better yet:

from collections import Counter

rows_and_counts = Counter(test_data)

These solutions do not copy the data: they retain references to the the original tuples just as they are. The numpy solution copies the data multiple times: once when converting to numpy, at least once when sorting, and possibly more when converting back.

These solutions have O(N) algorithmic complexity because they both use a hash table. The numpy unique solution uses sorting, and is therefore of O(N log N) complexity.

answered Sep 1, 2020 at 12:14

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kur ag Over a year ago

Based on your answer, I want to get list of keys and values seperate, like: counted = Counter(test_data) summary, repeat = list(counted.keys()), list(counted.values()). Is there any possibility sequences of lists can be different.

Mad Physicist Over a year ago

@kurag. No. The keys and values of a dictionary are always returned in the same order. It's part of the contract. Counter is a subclass of dictionary. If you are really worried, do summary = list(counted); repeat = [counted[k] for k in summary]

wprazuch · Accepted Answer · 2020-09-01 10:53:08Z

3

You could explicitly specify you dtype in np.array function preceding np.unique:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

test_data = np.array(test_data, dtype=[('letter', '<U1'),
                                ('x', np.int),
                                 ('y', np.int),
                                 ('z', np.int)])
                                 
summary, repeat = np.unique(test_data,return_counts=True, axis=0)

The summary then looks as follows:

array([('A', 1, 2, 3), ('B', 4, 5, 6)],
      dtype=[('letter', '<U1'), ('x', '<i4'), ('y', '<i4'), ('z', '<i4')])

answered Sep 1, 2020 at 10:53

wprazuch

1011 silver badge9 bronze badges

5 Comments

kur ag Over a year ago

This data comes from sql table, therefore I need take data types from sql table, and add to data.

juanpa.arrivillaga Over a year ago

@kurag why are you using numpy here? In any case, I'm not sure I understand the problem with this answer, why does it matter if it comes from a SQL table?

kur ag Over a year ago

@wprazuch Data taken from sql, and made some operations on it, at one stage I need unique rows with a number of times (repeat). I can use your answer but at first control original data types. If I made this I can use also Aratz answer with modifications, but your answer will be shorter. But I am looking for an answer at numpy in itself, if there is. And like I said to Aratz this should be in numpy, numpy after operations should present original structure, not edited version.

wprazuch Over a year ago

@kurag If that is the case, then nothing comes to my mind for now

Mad Physicist Over a year ago

@kurag. You do understand that np.unique is actually not that efficient? A python set or Counter have better scaling performance

Aratz · Accepted Answer · 2020-09-01 10:42:13Z

2

I think this has to do with the fact that in a numpy array, all items have to have the same type, what you could do instead is try to parse back your result when it comes out of numpy, e.g.:

result = []
for l in summary.tolist():
    new_l = []
    for v in l:
        try:
            new_l.append(int(v))
        except ValueError:
            new_l.append(v)
    result.append(tuple(new_l))

answered Sep 1, 2020 at 10:42

Aratz

4406 silver badges17 bronze badges

1 Comment

kur ag Over a year ago

thank you for answer but I think this should be in numpy itself. This data table is big and string column can be any column, therefore if there is not shorter way I will write detailed parsing function.

Collectives™ on Stack Overflow

Numpy unique changes integer to string

3 Answers 3

2 Comments

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related