2

I have about 5400 Excel files in multiple (sub)folders and want to load them into a single dataframe. The files only have 1 sheet and and can have up to 2000+ rows each. The total number of rows is expected to be 2 Million or more.

My computer has SSD HD and 8GB memory, and is pretty fast. Still it takes hours to complete. Is there anything wrong with me code? I'd appreciate any tips.

%%time
files = glob.glob('asyncDatas/**/*.xlsx',recursive=True)

df = pd.DataFrame()

for num, fname in enumerate(files, start=1):
    print("File #{} | {}".format(num, fname))
    if len(fname) > 0:
        data = pd.read_excel(fname, 'Sheet0', index_col='Time', skiprows=3)
        df = df.append(data)

df.head()

My hunch is that the .append method takes too much time as it likely is dynamically re-allocate memory? Would .concat() maybe the better approach?

2 Answers 2

2

First append to list of DataFrames and last only once concat, but still not sure if 8GB RAM is enough (but I hope so):

dfs = []

for num, fname in enumerate(files, start=1):
    print("File #{} | {}".format(num, fname))
    if len(fname) > 0:
        data = pd.read_excel(fname, 'Sheet0', index_col='Time', skiprows=3)
        dfs.append(data)

df = pd.concat(dfs, ignore_index=True)
Sign up to request clarification or add additional context in comments.

1 Comment

that did the trick. thanks! FYI: the dataframe uses about 2.4GB. All files loaded in 20 min.
0

Loading Excel data into Pandas is notoriously slow. Your first option is to use pd.concat once on a list of dataframes as described by jezrael.

Otherwise, you have a couple of options:

  1. Convert your Excel files to CSV efficiently outside of Python. For example, see this answer. Pandas performs better reading CSV files. You may see an extra improvement if you convert to csv.gz (gzipped).
  2. Consider categorical data to improve memory management; chunking; or lazy operations via a library. See this answer for more details.

If your workflow involves "read many times" I strongly advise you convert from Excel to a format more Pandas-friendly, such as CSV, HDF5, or Pickle.

1 Comment

Thanks, jpp! Appreciate your response.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.