Python, Loop to reading in Excel file sheets, change header row number

Question

I have a loop that counts the rows in each sheet of an xls. When I open the xls itself the count is not aligning with what python is returning me.

It is due to the first sheet header being in row 3. How can I alter my code to read the first sheet ONLY in at row 3 and ignore the first two lines? The rest of my sheets ALWAYS start at the top row and contain no header. I would like to count the len of my first sheet without header included.

However when I open up my excel and count my sheet I am getting

65522 , header starts in row 3, expecting a count of 65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

my full code:

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk


def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        #data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
        
        archive = ZipFile(file)

        # find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
        ]
        # ensure we only have one file (otherwise, loop or choose one somehow)
        assert len(excel_files_in_archive) == 1

        # read in data
        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None, header=None,
        )

        
        
               row_counts = []
    for sheet in list(data_mapping.keys()):
        if sheet == 'Sheet1':
            df = data_mapping.get(sheet)[3:]
         
        else:
              df = data_mapping.get(sheet)
        row_counts.append(len(df))
        print(len(data_mapping.get(sheet)))


      
        
        

        file_mapping.update({file: sum(row_counts)})

    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]

    return frame



dir_path = r'D:\test\2022 - 10'





zip_files = []
for root, dirs, files in os.walk(dir_path):
    for file in files:
        if file.endswith('.zip'):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)   #function

does anyone have an idea on what im doing wrong?

grantr · Accepted Answer · 2023-01-09 15:00:33Z

4

+50

You just need to use the skiprows argument: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

# read in data
data_mapping = pd.read_excel(
     BytesIO(archive.read(excel_files_in_archive[0])),
     sheet_name=None, header=None, skiprows=2
)

or don't use skiprows and then slice the sheet's dataframe directly:

row_counts = []
for sheet in list(data_mapping.keys()):
     if sheet == 'name of first sheet':
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))

##or based on the location in the list. you don't need to call list() on .keys()
for sheet, i in enumerate(data_mapping.keys()):
     if i == 0:
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))

edited Jan 9, 2023 at 15:00

answered Jan 9, 2023 at 1:31

grantr

1,07614 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jonnyboi Over a year ago

thanks, but this will then apply to all my sheets right? How can I get skiprows apply to only my first sheet?

grantr Over a year ago

ahh, yeah you're right- I just updated with a potential solution based on sheet name or location in list (you wouldn't use skiprows with it).

Jonnyboi Over a year ago

sorry for late reply - here is my error on your line : row_counts.append(len(df)) , I get type error TypeError: object of type 'NoneType' has no len() . Ill update my code in te question.

grantr Over a year ago

In your full code example I see you need to change this line if sheet == 'name of first sheet': if the first sheet always has a consistent name and is the one you are looking to trim. Otherwise you could use the other block I gave you that uses enumerate() and the index to identify the first sheet.

Jonnyboi Over a year ago

ok i updated my code in the question. Youre right, i missed that part. For some reason still getting 65522 for the first sheet, which is very annoying as it should be 65520.. and it looks to be started at row 3... any ideas? im going to debug further right now..

|

Collectives™ on Stack Overflow

Python, Loop to reading in Excel file sheets, change header row number

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related