Loading Excel file chunk by chunk with Python instead of loading full file into memory

Question

I want to read just 10 lines from Excel files (xlsx) without loading the whole file at once, as it can't be done on one of my machines (low memory).

I tried using

import xlrd
import pandas as pd
def open_file(path):
    xl = pd.ExcelFile(path)
    reader = xl.parse(chunksize=1000)
    for chunk in reader:
        print(chunk)

It seems like the file is loaded first then divided into parts.

How to read only first lines?

you're clearly using pandas, but haven't specified it - are you showing the full code? — Barnabus
– Barnabus, Commented Nov 23, 2017 at 12:30
although not strictly adhering to pandas, my advise would be to use SFrame which automatically does out of core computation and is fast. Install using pip install --U sframe — Vivek Kalyanarangan
– Vivek Kalyanarangan, Commented Nov 23, 2017 at 12:35
Barnabus there is no need for showing more code. I just want to load ONLY n rows. Edit: @VivekKalyanarangan Thanks for answer. I will check it out. — Kornel
– Kornel, Commented Nov 23, 2017 at 12:38
xslx is a zip file. It doesn't make it easier to read just a few lines.Could you read a CSV file instead? — Eric Duminil
– Eric Duminil, Commented Nov 23, 2017 at 12:56

DeepSpace · Accepted Answer · 2017-11-23 15:07:49Z

7

Due to the nature of xlsx files (which are essentially a bunch of xml files zipped together) you can't poke the file at an arbitrary byte and hope for it to be the beginning of Nth row of the table in the sheet you are interested in.

The best you can do is use pandas.read_excel with the skiprows (skips rows from the top of the file) and skip_footer (skips rows from the bottom) arguments. This however will load the whole file to memory first and then parse the required rows only.

# if the file contains 300 rows, this will read the middle 100
df = pd.read_excel('/path/excel.xlsx', skiprows=100, skip_footer=100,
                   names=['col_a', 'col_b'])

Note that you have to set the headers manually with the names argument otherwise the column names will be the last skipped row.

If you wish to use csv instead then it is a straightforward task since csv files are plain-text files.

But, and it's a big but, if you are really desperate you can extract the relevant sheet's xml file from the xlsx archive and parse that. It's not going to be an easy task though.

An example xml file that represents a sheet with a single 2 X 3 table. The <v> tags represent the cells' value.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
    <dimension ref="A1:B3"/>
    <sheetViews>
        <sheetView tabSelected="1" workbookViewId="0">
            <selection activeCell="C10" sqref="C10"/>
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultColWidth="11" defaultRowHeight="14.25" x14ac:dyDescent="0.2"/>
    <sheetData>
        <row r="1" spans="1:2" ht="15.75" x14ac:dyDescent="0.2">
            <c r="A1" t="s">
                <v>1</v>
            </c><c r="B1" s="1" t="s">
                <v>0</v>
            </c>
        </row>
        <row r="2" spans="1:2" ht="15" x14ac:dyDescent="0.2">
            <c r="A2" s="2">
                <v>1</v>
            </c><c r="B2" s="2">
                <v>4</v>
            </c>
        </row>
        <row r="3" spans="1:2" ht="15" x14ac:dyDescent="0.2">
            <c r="A3" s="2">
                <v>2</v>
            </c><c r="B3" s="2">
                <v>5</v>
            </c>
        </row>
    </sheetData>
    <pageMargins left="0.75" right="0.75" top="1" bottom="1" header="0.5" footer="0.5"/>
</worksheet>

edited Nov 23, 2017 at 15:07

answered Nov 23, 2017 at 12:49

DeepSpace

82.1k12 gold badges119 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Kornel Over a year ago

Problem is, that it loads whole file, then splits it to chunks. Thanks for answer, but it also is not what I am looking for.

DeepSpace Over a year ago

@Kornel See my updated answer, I've added some info.

Kornel Over a year ago

thank you for your answer. Nice bunch of information worth knowing. I will mark this as an answer to my question.

Viswanath Lekshmanan Over a year ago

@DeepSpace How could you decode the xlsx file ? It seems like its binary

DeepSpace Over a year ago

@ViswanathLekshmanan I didn't decode anything. I merely extracted the relevant sheet's XML file from the XLSX file, which, as explained in my answer, is "just" a fancy archive file. Try to open one with 7zip for example

|

Collectives™ on Stack Overflow

Loading Excel file chunk by chunk with Python instead of loading full file into memory

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related