0

Problem:

I got an error when I tried to open a product.xls with pd.read_excel ("NDC database file - Excel version (zip format)" downloaded from https://www.fda.gov/drugs/drug-approvals-and-databases/national-drug-code-directory)

df_product = pd.read_excel("tmp/Presentaciones.xls")

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
    io = ExcelFile(io, storage_options=storage_options, engine=engine)
  File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1071, in __init__
    ext = inspect_excel_format(
  File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 965, in inspect_excel_format
    raise ValueError("File is not a recognized excel file")
ValueError: File is not a recognized excel file

MY env:

pandas version: 1.2.4
xlrd version: 2.0.1
openpyxl version: 3.0.7

Thanks in advance.

3
  • I downloaded file and tried to read it with different engines and it gave error every time. If there is no solution, you can open the file in MS Excel and save as CSV, then you can read it with read_csv Commented Jun 17, 2021 at 12:03
  • 2
    The problem is that the files are not Excel (neither BIFF nor OOXML) at all but CSV (character separated, here using tabs as separators). Just rename the files and use read_csv with the relevant options. And report the problem to the website owner so that they can at least correct the labels! Commented Jun 17, 2021 at 12:24
  • @CharlieClark I tested it after renaming to csv directly, it works now with read csv. Thank you Commented Jun 17, 2021 at 12:33

1 Answer 1

3

I had a similar issue, where I had to read and combine a bunch of .xls files in a folder into one single dataframe. Turns out the error arose because .txt files were forcibly saved as .xls files. This also generated an error in excel upon attempting to open the file, which said

"The file format and extension of 'filename.xls' don't match. The file could be corrupted or unsafe. Unless you trust its source, don't open it. Do you want to open it anyway?"

Doing the following resolved it for me:

import glob 
import os 
import pandas as pd

path = r'C:\tmp' ## use your path

all_files = glob.glob(os.path.join(path, "*.xls"))
df_from_each_file = (pd.read_csv(f, delimiter = "\t") for f in all_files) ## reading the files using csv reader with tab delimiter
df1   = pd.concat(df_from_each_file, ignore_index=True)  ## concatenating all the individual files

If pd.read_csv doesn't work, you can also experimentally attempt to check which file reader on python is able to read your original file format.

P.S: Edited based on the comment from Yona

Sign up to request clarification or add additional context in comments.

1 Comment

Alakananda Giridhar, your solution it working for me too. But if you simply read the xls file with read_csv it work too. You don't really need to change the extension to txt before it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.