1

I have a large catalog that I am selecting data from according to the following criteria:

columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)

# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)

new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))

When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:

# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
    catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)

However, I am receiving an error: if catalog.logg[i] == -1: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Can someone please explain what I am doing wrong and how I can fix it. Thank you

Edit 1

My dataframe looks like the following:

Data columns:
System           477  non-null values
rp               477  non-null values
mp               477  non-null values
logg             477  non-null values
dtypes: float64(37), int64(3), object(3)None

Edit 2

 System  rp  mp  logg   FeH  FeHu  FeHl  Mstar  Mstaru  Mstarl  
0  target-01  5196     24     24  0.31  0.04  0.04  0.905   0.015   0.015   
1  target-02  5950    150    150 -0.30  0.25  0.25  0.950   0.110   0.110   
2  target-03  5598     50     50  0.04  0.05  0.05  0.997   0.049   0.049   
3  target-04  6558     44     -1  0.14  0.04  0.04  1.403   0.061   0.061   
4  target-05  6190     60     60  0.05  0.07  0.07  1.194   0.049   0.050   

....

[5 rows x 43 columns]

Edit 3

My code in a format that I understand should be:

for row in range(len(catalog)):
    parameter = catalog['logg'][row]
    if parameter == -1:
        parameter = catalog['mp'][row] / catalog['rp'][row]
    if parameter > 4.0 and parameter < 5.0:
        # select this row for further analysis

However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?

EDIT 4

Consider the following small example:

System     rp   mp    logg
target-01  2    -1     2     # will NOT be selected since mp = -1
target-02  -1    3     4     # will NOT be selected since rp = -1
target-03  7     6     4.3   # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04  3.2    15    -1   # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)
6
  • Could you show how is your dataframe looks like? Commented Nov 23, 2015 at 10:25
  • @AntonProtopopov, I edited the question. Please check it out. My df has more columns than the one I posted. I removed them for simplicity. Commented Nov 23, 2015 at 10:31
  • BTW What is mp[i] and rp[i]? Should it be as catalog.mp[i] and catalog.rp[i]? Commented Nov 23, 2015 at 10:42
  • yeah yeah, you are right! But still the errors persists. @AntonProtopopov Commented Nov 23, 2015 at 10:44
  • AIU you attach describe output but could you show actual data? Like df.head(10)? Commented Nov 23, 2015 at 10:58

2 Answers 2

1

you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:

catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']

which would modify the logg column inplace

As for edit 3:

rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]

which will select rows that satisfy the condition

Sign up to request clarification or add additional context in comments.

7 Comments

The error is from the for loop. I also tried if catalog.loc[i, 'logg'] == -1: and now the error I get is AttributeError: 'DataFrame' object has no attribute 'loc'
this same code works in my case, .loc is for slicing pandas dataframes, catalog.loc[i, 'logg'] == -1 is still a series,so you would still get the ambiguous truth value error, you should either iterate through the series one by one or use vectorized operations
Can you give me more details? I am new to pandas and I don't want to use a for loop.
catalog.loc[i,'logg'] will return a sub-series of logg column that the condition i is true, so you can directly modify this series by dividing respective sub-series mp by rp, the / taking to series will just do an element-wise division
I understand yet I do not know how to implement it. How did you fix the for loop?
|
0

Instead of that code:

if catalog.logg[i] == -1:
    catalog.logg[i] = catalog.mp[i] / catalog.rp[i]

You could use following:

i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']

For your edit 3 you need to add that line:

your_rows = df[(df.logg > 4) & (df.logg < 5)]

Full code:

i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]

EDIT

Probably I still don't understand what you want, but I get your desired output:

import pandas as pd
from io import StringIO

data = """
System     rp   mp    logg
target-01  2    -1     2     
target-02  -1    3     4     
target-03  7     6     4.3   
target-04  3.2    15    -1   
"""

catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]

In [7]: your_rows
Out[7]:
  System   rp  mp    logg
2  target-03  7.0   6  4.3000
3  target-04  3.2  15  4.6875

Am I still wrong?

4 Comments

It is not working. The full code you provided, the script is selecting all the targets that have logg = -1. That is not what I want.
So I didn't understand what you want.. AFAIU You need to select all (catalog.rp != -1) & (catalog.mp != -1) then replace all rows where df.logg == -1 to df.ix[i, 'mp'] / df.ix[i, 'rp'] and then choose all rows from modified df where (df.logg > 4) & (df.logg < 5). What exactly do you want?
What I want is the following: First, I select all (catalog.rp != -1) & (catalog.mp != -1). Second, IF catalog.logg == -1, replace -1` by catalog.mp / catalog.rp. Third, select all the entries (from the newly modified AND original df) where (catalog.logg > 4) & (catalog.logg < 5).
In edit 4 I provided an example. Thank you for your patience.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.