subsetting pandas dataframe

Question

I have found an inconsistency (at least to me) in the following two approaches:

For a dataframe defined as:

df=pd.DataFrame([[1,2,3,4,np.NaN],[8,2,0,4,5]])

I would like to access the element in the 1st row, 4th column (counting from 0). I either do this:

df[4][1]
Out[94]: 5.0

Or this:

df.iloc[1,4]
Out[95]: 5.

Am I correctly understanding that in the first approach I need to use the column first and then the rows, and vice versa when using iloc? I just want to make sure that I use both approaches correctly going forward.

EDIT: Some of the answers below have pointed out that the first approach is not as reliable, and I see now that this is why:

df.index = ['7','88']
df[4][1]
Out[101]: 5.0

I still get the correct result. But using int instead, will raise an exception if that corresponding number is not there anymore:

df.index = [7,88]
df[4][1]   
KeyError: 1

Also, changing the column names:

df.columns = ['4','5','6','1','5']
df['4'][1]
Out[108]: 8

Gives me a different result. So overall, I should stick to iloc or loc to avoid these issues.

Yes, but with the first case you can't always guarantee it'll work. However, with positional indexing, the indices are interpreted consistently. I'd stick to using loc or iloc or at or iat almost always unless there's no possibility of ambiguity. — cs95
– cs95, Commented Jan 4, 2018 at 5:06
You mean, the first approach won't work if I change the names of my rows and columns right? — Niccola Tartaglia
– Niccola Tartaglia, Commented Jan 4, 2018 at 5:26
Yes, it will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. — FatihAkici
– FatihAkici, Commented Jan 4, 2018 at 5:33

FatihAkici · Accepted Answer · 2018-01-04 05:35:57Z

2

You should think of DataFrames as a collection of columns. Therefore when you do df[4] you get the 4th column of df, which is of type Pandas Series. Afer this when you do df[4][1] you get the 1st element of this Series, which corresponds to the 1st row and 4th column entry of the DataFrame, which is what df.iloc[1,4] does exactly.

Therefore, no inconsistency at all, but beware: This will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. Hence, for positional indexing you must stick with iloc, or loc for name indexing.

edited Jan 4, 2018 at 5:35

answered Jan 4, 2018 at 5:21

FatihAkici

5,1594 gold badges34 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Niccola Tartaglia Over a year ago

I see, makes sense.

gepcel · Accepted Answer · 2018-01-04 05:55:15Z

2

Unfortunately, you are not using them correctly. It's just coincidence you get the same result.

df.loc[i, j] means the element in df with the row named i and the column named j

Besides many other defferences, df[j] means the column named j, and df[j][i] menas the column named j, and the element (which is row here) named i.

df.iloc[i, j] means the element in the i-th row and the j-th column started from 0.

So, df.loc select data by label (string or int or any other format, int in this case), df.iloc select data by position. It's just coincidence that in your example, the i-th row named i.

For more details you should read the doc

Update:

Think of df[4][1] as a convenient way. There are some logic background that under most circumstances you'll get what you want.

In fact

df.index = ['7', '88']
df[4][1]

works because the dtype of index is str. And you give an int 1, so it will fall back to position index. If you run:

df.index = [7, 88]
df[4][1]

Will raise an error. And

df.index = [1, 0]
df[4][1]

Sill won't be the element you expect. Because it's not the 1st row starts from 0. It will be the row with the name 1

edited Jan 4, 2018 at 5:55

answered Jan 4, 2018 at 5:17

gepcel

1,35612 silver badges22 bronze badges

5 Comments

Niccola Tartaglia Over a year ago

I see, so as long as my rows are ordered 0 to n, this will work, but if they are named differently (say I happen to start ordering at 1 instead of 0) than this won't be consistent. Overall, I should probably stick to using iloc.

gepcel Over a year ago

Yes, to select data by position, I strongly suggest you use df.iloc[i, j]

Niccola Tartaglia Over a year ago

Actually I noticed that the rows are not the problem, even if they are named differently this still selects the correct row. The columns are what can cause a problem. See my edit above.

gepcel Over a year ago

Not exact. If dtype of index (row) is 'str', and you use df[4][1], may be you can get the result. Try df.index = [0, 2] without int 1 in row names, df[4][1] will raise except.

Niccola Tartaglia Over a year ago

I see, yes, that makes sense. Thank you so much for taking the time to explain that. I was using str and that makes a difference of course. I will change my edit in the original post above. Highly appreciate it!!!

Collectives™ on Stack Overflow

subsetting pandas dataframe

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related