Read in .xlsx with csv module in python

Question

I'm trying to read in an excel file with .xlsx formatting with the csv module, but I'm not having any luck with it when using an excel file even with my dialect and encoding specified. Below, I show my different attempts and error results with the different encodings I tried. If anyone could point me into the correct coding, syntax or module I could use to read in a .xlsx file in Python, I'd appreciate it.

With the below code, I get the following error: _csv.Error: line contains NULL byte

#!/usr/bin/python

import sys, csv

with open('filelocation.xlsx', "r+", encoding="Latin1")  as inputFile:
    csvReader = csv.reader(inputFile, dialect='excel')
    for row in csvReader:
        print(row)

With the below code, I get the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 16: invalid continuation byte

#!/usr/bin/python

import sys, csv

with open('filelocation.xlsx', "r+", encoding="Latin1")  as inputFile:
    csvReader = csv.reader(inputFile, dialect='excel')
    for row in csvReader:
        print(row)

When I use utf-16 in the encoding, I get the following error: UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 570-571: illegal UTF-16 surrogate

You cannot read an xlsx file using csv module. Excel dialect only means that you can read CSV files that were created using Excel. — Selcuk
– Selcuk, Commented Mar 2, 2016 at 10:32
There are lots of modules you can find using a Google search, but you should test them to see if they fit to your use case. — Selcuk
– Selcuk, Commented Mar 2, 2016 at 10:57
Redaction of the question is impeccable, it really helped me, even without reading any answer. Thanks. — Moisés Briseño Estrello
– Moisés Briseño Estrello, Commented Mar 8, 2021 at 9:10

Martin Evans · Accepted Answer · 2021-01-29 12:31:52Z

34

You cannot use Python's csv library for reading xlsx formatted files. You need to install and use a different library. For example, you could use openpyxl as follows:

import openpyxl

wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active

for row in ws.iter_rows(values_only=True):
    print(row)

This would display all of the rows in the file as lists of row values. The Python Excel website gives other possible examples.

Alternatively you could create a list of rows:

import openpyxl

wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active

data = list(ws.iter_rows(values_only=True))

print(data)

Note: If you are using the older Excel format .xls, you could instead use the xlrd library. This no longer supports the .xlsx format though.

import xlrd

workbook = xlrd.open_workbook("filelocation.xlsx")
sheet = workbook.sheet_by_index(0)
data = [sheet.row_values(rowx) for rowx in range(sheet.nrows)]

print(data)

edited Jan 29, 2021 at 12:31

answered Mar 2, 2016 at 10:59

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John Y Over a year ago

I like that you recommended xlrd, as I believe it's the best Excel reader. But if you want a list of the values in a row, that is more easily (and probably more efficiently) accomplished by simply cols = sheet.row_values(row) instead of your list comprehension. Also, I recommend using some other name for the row index (I'm partial to rx; you'll see a lot of examples use rowx) because the name row so often refers to a row object.

Shep Sims Over a year ago

You can no longer use xlrd on xlsx files <a href="url">link text</a> To read xlsx files either (not recommended unless you have an outstanding reason to stick with xlrd) convert the file into xls manually before processing (recommended) use the newer openpyxl python module as recommended <a href="python-excel.org/">here</a> (also recommended) use pandas excel_file module

Collin Anderson · Accepted Answer · 2020-01-29 19:37:46Z

Here's a very very rough implementation using just the standard library.

def xlsx(fname, sheet=1):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet%s.xml' % sheet)):
        if el.tag.endswith('}v'):  # <v>84</v>
            value = el.text
        if el.tag.endswith('}c'):  # <c r="A3" t="s"><v>84</v></c>
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            column_name = ''.join(x for x in el.attrib['r'] if not x.isdigit())  # AZ22
            row[column_name] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

(This is copied from a deleted question: https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python )

Collin Anderson · Accepted Answer · 2020-01-29 18:45:30Z

Here's a very very rough implementation using just the standard library.

def xlsx(fname):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
        if el.tag.endswith('}v'):  # <v>84</v>
            value = el.text
        if el.tag.endswith('}c'):  # <c r="A3" t="s"><v>84</v></c>
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            letter = el.attrib['r'] # AZ22
            while letter[-1].isdigit():
                letter = letter[:-1]
            row[letter] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

This answer is copied from a deleted question: https://stackoverflow.com/a/22067980/131881

unorichardson · Accepted Answer · 2021-05-21 16:16:29Z

-2

You cannot use Python's csv library for reading .xlsx formatted files. You also can't use "pd.read_excel" which is a travesty (it only supports .xls). The below is a function I created to import .xlsx. It assigns the columns names on the first row of the file you import. Pretty straight forward.

def import_xlsx(filepath):
    wb=openpyxl.load_workbook(filename=filepath, data_only=True)
    ws = wb.active
    df = list(ws.iter_rows(values_only=True))
    new=pd.DataFrame(data=df)
    new1=new[1:]
    new1.columns=new[0:1].values[0].tolist()
    return(new1)

Example:

new_df=import_xlsx('C:\\Users\big_boi\\documents\\my_file.xlsx')

answered May 21, 2021 at 16:16

unorichardson

1

1 Comment

brec Over a year ago

As of pandas 1.3.0, read_excel supports .xlsx via the default openpyxl engine.

Collectives™ on Stack Overflow

Read in .xlsx with csv module in python

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related