2

I have data table which has string and integer columns such as:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

I need unique rows, therefore I used numpy unique function:

summary, repeat = np.unique(test_data,return_counts=True, axis=0)

But after then my data types are changed. Summary is:

array([['A', '1', '2', '3'],
   ['B', '4', '5', '6']], dtype='<U1')

All data types are now string. How can I prevent this change? (Python 3.7, numpy 1.16.4)

7
  • You cannot store multiple different data types in an array. ut since all of these can be chars, python will automatically assume char and convert them. If you stored them seperate you would get an integer array out of the ints. Commented Sep 1, 2020 at 10:40
  • 1
    @jaSnom not exactly "char" but yes, pretty much Commented Sep 1, 2020 at 10:55
  • @juanpa.arrivillaga yes you are correct. working both java nad python at my job makes me mix them up sometimes. Commented Sep 1, 2020 at 10:57
  • @juanpa.arrivillaga. technically you can have a recarray or an object array with tuples Commented Sep 1, 2020 at 12:09
  • @MadPhysicist sure, but technically, those are still storing a homogenous data-type, the structured dtype, or object :) Commented Sep 1, 2020 at 12:15

3 Answers 3

4

If you have python objects and you want to retain them as python objects, use python functions:

unique_rows = set(test_data)

Or better yet:

from collections import Counter

rows_and_counts = Counter(test_data)

These solutions do not copy the data: they retain references to the the original tuples just as they are. The numpy solution copies the data multiple times: once when converting to numpy, at least once when sorting, and possibly more when converting back.

These solutions have O(N) algorithmic complexity because they both use a hash table. The numpy unique solution uses sorting, and is therefore of O(N log N) complexity.

Sign up to request clarification or add additional context in comments.

2 Comments

Based on your answer, I want to get list of keys and values seperate, like: counted = Counter(test_data) summary, repeat = list(counted.keys()), list(counted.values()). Is there any possibility sequences of lists can be different.
@kurag. No. The keys and values of a dictionary are always returned in the same order. It's part of the contract. Counter is a subclass of dictionary. If you are really worried, do summary = list(counted); repeat = [counted[k] for k in summary]
3

You could explicitly specify you dtype in np.array function preceding np.unique:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

test_data = np.array(test_data, dtype=[('letter', '<U1'),
                                ('x', np.int),
                                 ('y', np.int),
                                 ('z', np.int)])
                                 
summary, repeat = np.unique(test_data,return_counts=True, axis=0)

The summary then looks as follows:

array([('A', 1, 2, 3), ('B', 4, 5, 6)],
      dtype=[('letter', '<U1'), ('x', '<i4'), ('y', '<i4'), ('z', '<i4')])

5 Comments

This data comes from sql table, therefore I need take data types from sql table, and add to data.
@kurag why are you using numpy here? In any case, I'm not sure I understand the problem with this answer, why does it matter if it comes from a SQL table?
@wprazuch Data taken from sql, and made some operations on it, at one stage I need unique rows with a number of times (repeat). I can use your answer but at first control original data types. If I made this I can use also Aratz answer with modifications, but your answer will be shorter. But I am looking for an answer at numpy in itself, if there is. And like I said to Aratz this should be in numpy, numpy after operations should present original structure, not edited version.
@kurag If that is the case, then nothing comes to my mind for now
@kurag. You do understand that np.unique is actually not that efficient? A python set or Counter have better scaling performance
2

I think this has to do with the fact that in a numpy array, all items have to have the same type, what you could do instead is try to parse back your result when it comes out of numpy, e.g.:

result = []
for l in summary.tolist():
    new_l = []
    for v in l:
        try:
            new_l.append(int(v))
        except ValueError:
            new_l.append(v)
    result.append(tuple(new_l))

1 Comment

thank you for answer but I think this should be in numpy itself. This data table is big and string column can be any column, therefore if there is not shorter way I will write detailed parsing function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.