CSV data - max values for segments of columns using numpy

Question

So Let's say I have a csv file with data like so:

'time'  'speed'
0       2.3
0       3.4
0       4.1
0       2.1
1       1.3
1       3.5
1       5.1
1       1.1
2       2.3
2       2.4
2       4.4
2       3.9

I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.

So basically I want this to return:

array([[0,41], [1,5.1],[2,4.4]])

Using numpy specifically.

And what have you tried so far?

sshashank124
– sshashank124

2014-03-30 14:56:36 +00:00
Commented Mar 30, 2014 at 14:56 — sshashank124
– sshashank124, Commented Mar 30, 2014 at 14:56
Related: stackoverflow.com/q/8623047/279627

Sven Marnach
– Sven Marnach

2014-03-30 15:07:02 +00:00
Commented Mar 30, 2014 at 15:07 — Sven Marnach
– Sven Marnach, Commented Mar 30, 2014 at 15:07

Sven Marnach · Accepted Answer · 2014-03-30 15:03:01Z

1

This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:

a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]

This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.

edited Mar 30, 2014 at 15:03

answered Mar 30, 2014 at 14:56

Sven Marnach

608k123 gold badges967 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

behzad.nouri · Accepted Answer · 2014-03-30 15:19:57Z

1

pandas fits nicely for this kind of stuff:

>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time  speed
... 0       2.3
... 0       3.4
... 0       4.1
... 0       2.1
... 1       1.3
... 1       3.5
... 1       5.1
... 1       1.1
... 2       2.3
... 2       2.4
... 2       4.4
... 2       3.9
... """), delim_whitespace=True)
>>> df
    time  speed
0      0    2.3
1      0    3.4
2      0    4.1
3      0    2.1
4      1    1.3
5      1    3.5
6      1    5.1
7      1    1.1
8      2    2.3
9      2    2.4
10     2    4.4
11     2    3.9

[12 rows x 2 columns]

once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:

>>> df.groupby('time')['speed'].aggregate(max)
time
0       4.1
1       5.1
2       4.4
Name: speed, dtype: float64

edited Mar 30, 2014 at 15:19

answered Mar 30, 2014 at 14:59

behzad.nouri

78.5k18 gold badges130 silver badges127 bronze badges

1 Comment

DSM Over a year ago

Even though numpy is the wrong choice for this problem, and pandas a much better one, the OP did say "Using numpy specifically"..

Collectives™ on Stack Overflow

CSV data - max values for segments of columns using numpy

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related