Pandas read_excel wrong output

Question

Update: so I'm messing around with the two files, I copied the data from DOB column from the 2nd file to the 1st file to make the files look visually identical. However I notice some really interesting behaviour when using Ctrl+F in Microsoft Excel. When I leave the search box blank in the first file it finds no matches. However when I repeat the same operation with the 2nd file it finds 21 matches from each cell between E1 to G7. I suppose somehow there are some blank/invisible cells in the 2nd file - and that's what's causing read_excel to behave differently.

My goal is to simply execute the pandas read_excel function. However, I'm running into a very strange situation in which I am trying to run pandas.read_excel on two very similar excel files but am getting substantially different results.

Code

import pandas

data1 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_1.xlsx')
print(data1)
data2 = pandas.read_excel(r'D:\User Files\Downloads\Programming\stackoverflow\test_2.xlsx')
print(data2)

Output

        Name        DOB Class Year  GPA
0  Redeye438 2008-09-22      Fresh    1
1  Redeye439 2009-09-20       Soph    2
2  Redeye440 2010-09-22     Junior    3
3  Redeye441 2011-09-20     Senior    4
4  Redeye442 2008-09-20      Fresh    4
5  Redeye443 2009-09-22       Soph    3
                             Name  DOB  Class Year  GPA
Redeye438 2011-09-20 Fresh      1  NaN         NaN  NaN
Redeye439 2010-09-22 Soph       2  NaN         NaN  NaN
Redeye440 2009-09-20 Junior     3  NaN         NaN  NaN
Redeye441 2008-09-22 Senior     4  NaN         NaN  NaN
Redeye442 2011-09-22 Fresh      4  NaN         NaN  NaN
Redeye443 2010-09-20 Soph       3  NaN         NaN  NaN

Why are the columns mapped incorrectly for data2?

The excel files in question (the only difference is the data in the DOB column): Excel file downloads

try to instal openpyxl and xlrd, these are pandas dependencies helping to read excel files — NoobVB
– NoobVB, Commented Sep 21, 2022 at 18:55
Thanks for the suggestion @NoobVB but this didn't work. I think it's something with the excel files. — holts-shoe
– holts-shoe, Commented Sep 21, 2022 at 19:00
I installed/updated the dependencies incorrectly. You were correct, thank you. — holts-shoe
– holts-shoe, Commented Sep 21, 2022 at 19:29

Timeless · Accepted Answer · 2022-09-21 19:00:46Z

It looks like you're using an older version of Pandas because I can't reproduce the issue on the latest version.

import pandas as pd 

pd.show_versions()

INSTALLED VERSIONS
------------------
python           : 3.10.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
pandas           : 1.5.0

As you can see below, the columns of data2 are mapped correctly.

import pandas as pd

data1 = pd.read_excel(r"C:\Users\abokey\Downloads\test_1.xlsx")
print(data1)
data2 = pd.read_excel(r"C:\Users\abokey\Downloads\test_2.xlsx")
print(data2)

        Name        DOB Class Year  GPA
0  Redeye438 2008-09-22      Fresh    1
1  Redeye439 2009-09-20       Soph    2
2  Redeye440 2010-09-22     Junior    3
3  Redeye441 2011-09-20     Senior    4
4  Redeye442 2008-09-20      Fresh    4
5  Redeye443 2009-09-22       Soph    3
        Name        DOB Class Year  GPA
0  Redeye438 2011-09-20      Fresh    1
1  Redeye439 2010-09-22       Soph    2
2  Redeye440 2009-09-20     Junior    3
3  Redeye441 2008-09-22     Senior    4
4  Redeye442 2011-09-22      Fresh    4
5  Redeye443 2010-09-20       Soph    3

However, you're right because the format of the two Excel files are not the same. In fact, when looking at test2.xlsx more carefully, this one seems to carry 7x3 blank cells (rows/cols). The latest version of pandas seems to handle this kind of Excel file since the empty cells are ignored when calling pandas.read_excel.

So in principle, upgrading your pandas version should fix the problem :

pip install --upgrade pandas

If the problem persists, clean your Excel file test2.xlsx by doing this :

Open up the file in Excel
Click on Find & Select
Choose Go To Special and Blanks
Click on Clear All
Save your changes

Thanks @abokey, upgrading pandas from version 1.2.1 to 1.5.0 solved the issue. Additionally removing the blank cells was a good second solution. Thanks very much for your thorough help.

Collectives™ on Stack Overflow

Pandas read_excel wrong output

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related