selecting data using pandas

Question

I have a large catalog that I am selecting data from according to the following criteria:

columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)

# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)

new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))

When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:

# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
    catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)

However, I am receiving an error: if catalog.logg[i] == -1: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Can someone please explain what I am doing wrong and how I can fix it. Thank you

Edit 1

My dataframe looks like the following:

Data columns:
System           477  non-null values
rp               477  non-null values
mp               477  non-null values
logg             477  non-null values
dtypes: float64(37), int64(3), object(3)None

Edit 2

 System  rp  mp  logg   FeH  FeHu  FeHl  Mstar  Mstaru  Mstarl  
0  target-01  5196     24     24  0.31  0.04  0.04  0.905   0.015   0.015   
1  target-02  5950    150    150 -0.30  0.25  0.25  0.950   0.110   0.110   
2  target-03  5598     50     50  0.04  0.05  0.05  0.997   0.049   0.049   
3  target-04  6558     44     -1  0.14  0.04  0.04  1.403   0.061   0.061   
4  target-05  6190     60     60  0.05  0.07  0.07  1.194   0.049   0.050   

....

[5 rows x 43 columns]

Edit 3

My code in a format that I understand should be:

for row in range(len(catalog)):
    parameter = catalog['logg'][row]
    if parameter == -1:
        parameter = catalog['mp'][row] / catalog['rp'][row]
    if parameter > 4.0 and parameter < 5.0:
        # select this row for further analysis

However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?

EDIT 4

Consider the following small example:

System     rp   mp    logg
target-01  2    -1     2     # will NOT be selected since mp = -1
target-02  -1    3     4     # will NOT be selected since rp = -1
target-03  7     6     4.3   # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04  3.2    15    -1   # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)

@AntonProtopopov, I edited the question. Please check it out. My df has more columns than the one I posted. I removed them for simplicity. — aloha
– aloha, Commented Nov 23, 2015 at 10:31
BTW What is mp[i] and rp[i]? Should it be as catalog.mp[i] and catalog.rp[i]? — Anton Protopopov
– Anton Protopopov, Commented Nov 23, 2015 at 10:42
yeah yeah, you are right! But still the errors persists. @AntonProtopopov — aloha
– aloha, Commented Nov 23, 2015 at 10:44
AIU you attach describe output but could you show actual data? Like df.head(10)? — Anton Protopopov
– Anton Protopopov, Commented Nov 23, 2015 at 10:58

Samuelliyi · Accepted Answer · 2015-11-23 12:17:15Z

1

you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:

catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']

which would modify the logg column inplace

As for edit 3:

rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]

which will select rows that satisfy the condition

edited Nov 23, 2015 at 12:17

answered Nov 23, 2015 at 10:38

Samuelliyi

681 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

aloha Over a year ago

The error is from the for loop. I also tried if catalog.loc[i, 'logg'] == -1: and now the error I get is AttributeError: 'DataFrame' object has no attribute 'loc'

Samuelliyi Over a year ago

this same code works in my case, .loc is for slicing pandas dataframes, catalog.loc[i, 'logg'] == -1 is still a series,so you would still get the ambiguous truth value error, you should either iterate through the series one by one or use vectorized operations

aloha Over a year ago

Can you give me more details? I am new to pandas and I don't want to use a for loop.

Samuelliyi Over a year ago

catalog.loc[i,'logg'] will return a sub-series of logg column that the condition i is true, so you can directly modify this series by dividing respective sub-series mp by rp, the / taking to series will just do an element-wise division

aloha Over a year ago

I understand yet I do not know how to implement it. How did you fix the for loop?

|

Anton Protopopov · Accepted Answer · 2015-11-23 13:11:57Z

0

Instead of that code:

if catalog.logg[i] == -1:
    catalog.logg[i] = catalog.mp[i] / catalog.rp[i]

You could use following:

i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']

For your edit 3 you need to add that line:

your_rows = df[(df.logg > 4) & (df.logg < 5)]

Full code:

i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]

EDIT

Probably I still don't understand what you want, but I get your desired output:

import pandas as pd
from io import StringIO

data = """
System     rp   mp    logg
target-01  2    -1     2     
target-02  -1    3     4     
target-03  7     6     4.3   
target-04  3.2    15    -1   
"""

catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]

In [7]: your_rows
Out[7]:
  System   rp  mp    logg
2  target-03  7.0   6  4.3000
3  target-04  3.2  15  4.6875

Am I still wrong?

edited Nov 23, 2015 at 13:11

answered Nov 23, 2015 at 11:30

Anton Protopopov

31.9k13 gold badges93 silver badges96 bronze badges

4 Comments

aloha Over a year ago

It is not working. The full code you provided, the script is selecting all the targets that have logg = -1. That is not what I want.

Anton Protopopov Over a year ago

So I didn't understand what you want.. AFAIU You need to select all (catalog.rp != -1) & (catalog.mp != -1) then replace all rows where df.logg == -1 to df.ix[i, 'mp'] / df.ix[i, 'rp'] and then choose all rows from modified df where (df.logg > 4) & (df.logg < 5). What exactly do you want?

aloha Over a year ago

What I want is the following: First, I select all (catalog.rp != -1) & (catalog.mp != -1). Second, IF catalog.logg == -1, replace -1` by catalog.mp / catalog.rp. Third, select all the entries (from the newly modified AND original df) where (catalog.logg > 4) & (catalog.logg < 5).

aloha Over a year ago

In edit 4 I provided an example. Thank you for your patience.

Collectives™ on Stack Overflow

selecting data using pandas

Edit 1

Edit 2

Edit 3

EDIT 4

2 Answers 2

7 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Edit 1

Edit 2

Edit 3

EDIT 4

2 Answers 2

7 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related