subset pandas dataframe with corresponding numpy array

Question

I have a pandas dataframe with following columns.

    order_id latitude
0       519  19.119677
1       519  19.119677
2       520  19.042117
3       520  19.042117
4       520  19.042117
5       521  19.138245
6       523  19.117662
7       523  19.117662
8       523  19.117662
9       523  19.117662
10      523  19.117662
11      524  19.137793
12      525  19.119372
13      526   0.000000
14      526   0.000000
15      526   0.000000
16      527  19.133430
17      528   0.000000
18      529  19.118284
19      530   0.000000
20      531  19.114269
21      531  19.114269
22      532  19.136292
23      533  19.119075
24      533  19.119075
25      533  19.119075
26      534  19.119677
27      535  19.119677
28      535  19.119677
29      535  19.119677

order_id is repeated, I want unique order_id values which I can get by

unique_order_id = pd.unique(tsp_data['order_id'])

array(['519', '520', '521', '523', '524', '525', '526', '527', '528',
   '529', '530', '531', '532', '533', '534', '535'], dtype=object)

Which returns me correct unique values. I am storing it in unique_order_id variable. Now I want only corresponding lat values for unique order_id values.

I am doing something like this.

tsp_data['latitude'][tsp_data['order_id'].isin(unique_order_id)]

But it returns me all 30 rows. Where I am getting wrong? please help

why don't you just drop the duplicates? df.drop_duplicates()? — EdChum
– EdChum, Commented Jan 11, 2016 at 10:29
alternatively you can just do df.groupby('order_id').first().reset_index() — EdChum
– EdChum, Commented Jan 11, 2016 at 10:31
As to why what you attempted failed, by passing isin you're testing for membership so it will return essentially all the rows anyway as there exist rows for each order_id — EdChum
– EdChum, Commented Jan 11, 2016 at 10:39

Anton Protopopov · Accepted Answer · 2016-01-11 10:31:43Z

You could use pd.pivot_table which will return first values by order_id:

In [184]: tsp_data.pivot_table(index='order_id', values='latitude')
Out[184]:
order_id
519    19.119677
520    19.042117
521    19.138245
523    19.117662
524    19.137793
525    19.119372
526     0.000000
527    19.133430
528     0.000000
529    19.118284
530     0.000000
531    19.114269
532    19.136292
533    19.119075
534    19.119677
535    19.119677
Name: latitude, dtype: float64

Or you could use drop_duplicates:

In [185]: tsp_data.drop_duplicates(subset=['order_id'])
Out[185]:
    order_id   latitude
0        519  19.119677
2        520  19.042117
5        521  19.138245
6        523  19.117662
11       524  19.137793
12       525  19.119372
13       526   0.000000
16       527  19.133430
17       528   0.000000
18       529  19.118284
19       530   0.000000
20       531  19.114269
22       532  19.136292
23       533  19.119075
26       534  19.119677
27       535  19.119677

Or groupby as @EdChum suggested

Collectives™ on Stack Overflow

subset pandas dataframe with corresponding numpy array

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related