2

I have a loop that counts the rows in each sheet of an xls. When I open the xls itself the count is not aligning with what python is returning me.

It is due to the first sheet header being in row 3. How can I alter my code to read the first sheet ONLY in at row 3 and ignore the first two lines? The rest of my sheets ALWAYS start at the top row and contain no header. I would like to count the len of my first sheet without header included.

However when I open up my excel and count my sheet I am getting

65522 , header starts in row 3, expecting a count of 65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
65520
25427

my full code:

from io import BytesIO
from pathlib import Path
from zipfile import ZipFile
import os
import pandas as pd
from os import walk


def process_files(files: list) -> pd.DataFrame:
    file_mapping = {}
    for file in files:
        #data_mapping = pd.read_excel(BytesIO(ZipFile(file).read(Path(file).stem)), sheet_name=None)
        
        archive = ZipFile(file)

        # find file names in the archive which end in `.xls`, `.xlsx`, `.xlsb`, ...
        files_in_archive = archive.namelist()
        excel_files_in_archive = [
            f for f in files_in_archive if Path(f).suffix[:4] == ".xls"
        ]
        # ensure we only have one file (otherwise, loop or choose one somehow)
        assert len(excel_files_in_archive) == 1

        # read in data
        data_mapping = pd.read_excel(
            BytesIO(archive.read(excel_files_in_archive[0])),
            sheet_name=None, header=None,
        )

        
        
               row_counts = []
    for sheet in list(data_mapping.keys()):
        if sheet == 'Sheet1':
            df = data_mapping.get(sheet)[3:]
         
        else:
              df = data_mapping.get(sheet)
        row_counts.append(len(df))
        print(len(data_mapping.get(sheet)))


      
        
        

        file_mapping.update({file: sum(row_counts)})

    frame = pd.DataFrame([file_mapping]).transpose().reset_index()
    frame.columns = ["file_name", "row_counts"]

    return frame



dir_path = r'D:\test\2022 - 10'





zip_files = []
for root, dirs, files in os.walk(dir_path):
    for file in files:
        if file.endswith('.zip'):
            zip_files.append(os.path.join(root, file))
df = process_files(zip_files)   #function

does anyone have an idea on what im doing wrong?

1 Answer 1

4
+50

You just need to use the skiprows argument: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

# read in data
data_mapping = pd.read_excel(
     BytesIO(archive.read(excel_files_in_archive[0])),
     sheet_name=None, header=None, skiprows=2
)

or don't use skiprows and then slice the sheet's dataframe directly:

row_counts = []
for sheet in list(data_mapping.keys()):
     if sheet == 'name of first sheet':
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))

##or based on the location in the list. you don't need to call list() on .keys()
for sheet, i in enumerate(data_mapping.keys()):
     if i == 0:
          df = data_mapping.get(sheet)[3:]
     else:
          df = data_mapping.get(sheet)
     row_counts.append(len(df))
     print(len(data_mapping.get(sheet)))
Sign up to request clarification or add additional context in comments.

6 Comments

thanks, but this will then apply to all my sheets right? How can I get skiprows apply to only my first sheet?
ahh, yeah you're right- I just updated with a potential solution based on sheet name or location in list (you wouldn't use skiprows with it).
sorry for late reply - here is my error on your line : row_counts.append(len(df)) , I get type error TypeError: object of type 'NoneType' has no len() . Ill update my code in te question.
In your full code example I see you need to change this line if sheet == 'name of first sheet': if the first sheet always has a consistent name and is the one you are looking to trim. Otherwise you could use the other block I gave you that uses enumerate() and the index to identify the first sheet.
ok i updated my code in the question. Youre right, i missed that part. For some reason still getting 65522 for the first sheet, which is very annoying as it should be 65520.. and it looks to be started at row 3... any ideas? im going to debug further right now..
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.