1

Starting off with a structured numpy array that has 4 fields, I am trying to return an array with just the latest dates, by ID, containing the same 4 fields. I found a solution using itertools.groupby that almost works here: Numpy Mean Structured Array

The problem is I don't understand how to adapt this when you have 4 fields instead of 2. I want to get the whole 'row' back, but only the rows for the latest dates for each ID. I understand that this kind of thing is simpler using pandas, but this is just a small piece of a larger process, and I can't add pandas as a dependency.

data = np.array([('2005-02-01', 1, 3, 8),
             ('2005-02-02', 1, 4, 9),
             ('2005-02-01', 2, 5, 10),
             ('2005-02-02', 2, 6, 11),
             ('2005-02-03', 2, 7, 12)], 
             dtype=[('dt', 'datetime64[D]'), ('ID', '<i4'), ('f3', '<i4'),    
             ('f4', '<i4')])

For this example array, my desired output would be:

np.array([(datetime.date(2005, 2, 2), 1, 4, 9),
          (datetime.date(2005, 2, 3), 2, 7, 12)],
         dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])

This is what I've tried:

latest = np.array([(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
              ['dt'].argmax()) for k, g in 
              groupby(np.sort(data, order='ID').view(np.recarray),
              itemgetter('ID'))], dtype=data.dtype)

I get this error:

ValueError: size of tuple must match number of fields.

I think this is because the tuple has 2 fields but the array has 4. When I drop 'f3' and 'f4' from the array it works correctly.

How can I get it to return all 4 fields?

7
  • 1
    I would strongly recommend using pandas for this. It would be much easier. Commented Apr 14, 2015 at 23:05
  • What exactly is your desired output for the example array above? Commented Apr 14, 2015 at 23:41
  • @ali_m I'm looking for an array like below Commented Apr 14, 2015 at 23:47
  • Is it correct that you only want to keep the 'dt' and 'ID' fields in the result? Commented Apr 14, 2015 at 23:54
  • array([(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)], dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')]) Commented Apr 14, 2015 at 23:56

1 Answer 1

0

Lets figure out where your error is by pealing off one layer:

In [38]: from operator import itemgetter
In [39]: from itertools import groupby
In [41]: [(k, np.array(list(g), dtype=data.dtype).view(np.recarray)
          ['dt'].argmax()) for k, g in 
          groupby(np.sort(data, order='ID').view(np.recarray),
          itemgetter('ID'))]
Out[41]: [(1, 1), (2, 2)]

What is this list of tuples supposed to represent? It clearly isn't rows from data. And since each tuple has only 2 items it can't be mapped onto a data.dtype array. Hence the value error.


After playing around with this a bit, I think: [(1, 1), (2, 2)] means, for ID==1, use the [1] item from the group; for ID==2, use [2] item from the group.

[(datetime.date(2005, 2, 2), 1, 4, 9),
 (datetime.date(2005, 2, 3), 2, 7, 12)]

You have found the maximum dates, but you have to translate those to either indexes in data, or select those items from the groups.

In [91]: groups=groupby(np.sort(data, order='ID').itemgetter('ID'))
# don't need recarray

In [92]: G = [(k,list(g)) for k,g in groups]

In [93]: G
Out[93]: 
[(1,
  [(datetime.date(2005, 2, 1), 1, 3, 8),
   (datetime.date(2005, 2, 2), 1, 4, 9)]),
 (2,
  [(datetime.date(2005, 2, 1), 2, 5, 10),
   (datetime.date(2005, 2, 2), 2, 6, 11),
   (datetime.date(2005, 2, 3), 2, 7, 12)])]
In [107]: I=[(1,1), (2,2)]

In [108]: [g[1][i[1]] for g,i in zip(G,I)]
Out[108]: [(datetime.date(2005, 2, 2), 1, 4, 9), (datetime.date(2005, 2, 3), 2, 7, 12)]

OK, this selection from G is clumsy, but it is a start.


If I define a simple function to pull the record with the latest date from a group, the processing is a lot simpler.

def maxdate_record(agroup):
    an_array = np.array(list(agroup))
    i = np.argmax(an_array['dt'])
    return an_array[i]

groups = groupby(np.sort(data, order='ID'),itemgetter('ID'))
np.array([maxdate_record(g) for k,g in groups])

producing:

array([(datetime.date(2005, 2, 2), 1, 4, 9),
       (datetime.date(2005, 2, 3), 2, 7, 12)], 
      dtype=[('dt', '<M8[D]'), ('ID', '<i4'), ('f3', '<i4'), ('f4', '<i4')])

I don't need to specify dtype when I convert a list of records to an array, since the records have their own dtype.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, the last function is exactly what I was looking for. This is the first time I've used the itertools library, is there a way to 'look under the hood' of a groupby object? For example, when I input the groups object, all I get back is <itertools.groupby object at 0x0235F3F0>, which makes it hard to tell if it's doing what I want it to.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.