1

I have found an inconsistency (at least to me) in the following two approaches:

For a dataframe defined as:

df=pd.DataFrame([[1,2,3,4,np.NaN],[8,2,0,4,5]])

I would like to access the element in the 1st row, 4th column (counting from 0). I either do this:

df[4][1]
Out[94]: 5.0

Or this:

df.iloc[1,4]
Out[95]: 5.

Am I correctly understanding that in the first approach I need to use the column first and then the rows, and vice versa when using iloc? I just want to make sure that I use both approaches correctly going forward.

EDIT: Some of the answers below have pointed out that the first approach is not as reliable, and I see now that this is why:

df.index = ['7','88']
df[4][1]
Out[101]: 5.0

I still get the correct result. But using int instead, will raise an exception if that corresponding number is not there anymore:

df.index = [7,88]
df[4][1]   
KeyError: 1

Also, changing the column names:

df.columns = ['4','5','6','1','5']
df['4'][1]
Out[108]: 8

Gives me a different result. So overall, I should stick to iloc or loc to avoid these issues.

6
  • Yes, but with the first case you can't always guarantee it'll work. However, with positional indexing, the indices are interpreted consistently. I'd stick to using loc or iloc or at or iat almost always unless there's no possibility of ambiguity. Commented Jan 4, 2018 at 5:06
  • You mean, the first approach won't work if I change the names of my rows and columns right? Commented Jan 4, 2018 at 5:26
  • Yes, it will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. Commented Jan 4, 2018 at 5:33
  • That makes sense. Thank you so much for the help. Commented Jan 4, 2018 at 5:38
  • And you should also test df.index = [0, 2]; df[4][1] Commented Jan 4, 2018 at 5:44

2 Answers 2

2

You should think of DataFrames as a collection of columns. Therefore when you do df[4] you get the 4th column of df, which is of type Pandas Series. Afer this when you do df[4][1] you get the 1st element of this Series, which corresponds to the 1st row and 4th column entry of the DataFrame, which is what df.iloc[1,4] does exactly.

Therefore, no inconsistency at all, but beware: This will work only if you don't have any column names, or if your column names are [0,1,2,3,4]. Else, it will either fail or give you a wrong result. Hence, for positional indexing you must stick with iloc, or loc for name indexing.

Sign up to request clarification or add additional context in comments.

1 Comment

I see, makes sense.
2

Unfortunately, you are not using them correctly. It's just coincidence you get the same result.

df.loc[i, j] means the element in df with the row named i and the column named j

Besides many other defferences, df[j] means the column named j, and df[j][i] menas the column named j, and the element (which is row here) named i.

df.iloc[i, j] means the element in the i-th row and the j-th column started from 0.

So, df.loc select data by label (string or int or any other format, int in this case), df.iloc select data by position. It's just coincidence that in your example, the i-th row named i.

For more details you should read the doc

Update:

Think of df[4][1] as a convenient way. There are some logic background that under most circumstances you'll get what you want.

In fact

df.index = ['7', '88']
df[4][1]

works because the dtype of index is str. And you give an int 1, so it will fall back to position index. If you run:

df.index = [7, 88]
df[4][1]

Will raise an error. And

df.index = [1, 0]
df[4][1]

Sill won't be the element you expect. Because it's not the 1st row starts from 0. It will be the row with the name 1

5 Comments

I see, so as long as my rows are ordered 0 to n, this will work, but if they are named differently (say I happen to start ordering at 1 instead of 0) than this won't be consistent. Overall, I should probably stick to using iloc.
Yes, to select data by position, I strongly suggest you use df.iloc[i, j]
Actually I noticed that the rows are not the problem, even if they are named differently this still selects the correct row. The columns are what can cause a problem. See my edit above.
Not exact. If dtype of index (row) is 'str', and you use df[4][1], may be you can get the result. Try df.index = [0, 2] without int 1 in row names, df[4][1] will raise except.
I see, yes, that makes sense. Thank you so much for taking the time to explain that. I was using str and that makes a difference of course. I will change my edit in the original post above. Highly appreciate it!!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.