21

I'm trying to read in an excel file with .xlsx formatting with the csv module, but I'm not having any luck with it when using an excel file even with my dialect and encoding specified. Below, I show my different attempts and error results with the different encodings I tried. If anyone could point me into the correct coding, syntax or module I could use to read in a .xlsx file in Python, I'd appreciate it.

With the below code, I get the following error: _csv.Error: line contains NULL byte

#!/usr/bin/python

import sys, csv

with open('filelocation.xlsx', "r+", encoding="Latin1")  as inputFile:
    csvReader = csv.reader(inputFile, dialect='excel')
    for row in csvReader:
        print(row)

With the below code, I get the following error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 16: invalid continuation byte

#!/usr/bin/python

import sys, csv

with open('filelocation.xlsx', "r+", encoding="Latin1")  as inputFile:
    csvReader = csv.reader(inputFile, dialect='excel')
    for row in csvReader:
        print(row)

When I use utf-16 in the encoding, I get the following error: UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 570-571: illegal UTF-16 surrogate

4
  • 4
    You cannot read an xlsx file using csv module. Excel dialect only means that you can read CSV files that were created using Excel. Commented Mar 2, 2016 at 10:32
  • do you know of any modules that can read .xlsx files? Commented Mar 2, 2016 at 10:57
  • There are lots of modules you can find using a Google search, but you should test them to see if they fit to your use case. Commented Mar 2, 2016 at 10:57
  • Redaction of the question is impeccable, it really helped me, even without reading any answer. Thanks. Commented Mar 8, 2021 at 9:10

4 Answers 4

34

You cannot use Python's csv library for reading xlsx formatted files. You need to install and use a different library. For example, you could use openpyxl as follows:

import openpyxl

wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active

for row in ws.iter_rows(values_only=True):
    print(row)

This would display all of the rows in the file as lists of row values. The Python Excel website gives other possible examples.


Alternatively you could create a list of rows:

import openpyxl

wb = openpyxl.load_workbook("filelocation.xlsx")
ws = wb.active

data = list(ws.iter_rows(values_only=True))

print(data)

Note: If you are using the older Excel format .xls, you could instead use the xlrd library. This no longer supports the .xlsx format though.

import xlrd

workbook = xlrd.open_workbook("filelocation.xlsx")
sheet = workbook.sheet_by_index(0)
data = [sheet.row_values(rowx) for rowx in range(sheet.nrows)]

print(data)
Sign up to request clarification or add additional context in comments.

2 Comments

I like that you recommended xlrd, as I believe it's the best Excel reader. But if you want a list of the values in a row, that is more easily (and probably more efficiently) accomplished by simply cols = sheet.row_values(row) instead of your list comprehension. Also, I recommend using some other name for the row index (I'm partial to rx; you'll see a lot of examples use rowx) because the name row so often refers to a row object.
You can no longer use xlrd on xlsx files <a href="url">link text</a><br><br> To read xlsx files either <br><br> (not recommended unless you have an outstanding reason to stick with xlrd)<br> convert the file into xls manually before processing<br><br> (recommended)<br> use the newer openpyxl python module as recommended <a href="python-excel.org/">here</a> (also recommended)<br> use pandas excel_file module
5

Here's a very very rough implementation using just the standard library.

def xlsx(fname, sheet=1):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet%s.xml' % sheet)):
        if el.tag.endswith('}v'):  # <v>84</v>
            value = el.text
        if el.tag.endswith('}c'):  # <c r="A3" t="s"><v>84</v></c>
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            column_name = ''.join(x for x in el.attrib['r'] if not x.isdigit())  # AZ22
            row[column_name] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

(This is copied from a deleted question: https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python )

Comments

2

Here's a very very rough implementation using just the standard library.

def xlsx(fname):
    import zipfile
    from xml.etree.ElementTree import iterparse
    z = zipfile.ZipFile(fname)
    strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
    rows = []
    row = {}
    value = ''
    for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
        if el.tag.endswith('}v'):  # <v>84</v>
            value = el.text
        if el.tag.endswith('}c'):  # <c r="A3" t="s"><v>84</v></c>
            if el.attrib.get('t') == 's':
                value = strings[int(value)]
            letter = el.attrib['r'] # AZ22
            while letter[-1].isdigit():
                letter = letter[:-1]
            row[letter] = value
            value = ''
        if el.tag.endswith('}row'):
            rows.append(row)
            row = {}
    return rows

This answer is copied from a deleted question: https://stackoverflow.com/a/22067980/131881

Comments

-2

You cannot use Python's csv library for reading .xlsx formatted files. You also can't use "pd.read_excel" which is a travesty (it only supports .xls). The below is a function I created to import .xlsx. It assigns the columns names on the first row of the file you import. Pretty straight forward.

def import_xlsx(filepath):
    wb=openpyxl.load_workbook(filename=filepath, data_only=True)
    ws = wb.active
    df = list(ws.iter_rows(values_only=True))
    new=pd.DataFrame(data=df)
    new1=new[1:]
    new1.columns=new[0:1].values[0].tolist()
    return(new1)

Example:

new_df=import_xlsx('C:\\Users\big_boi\\documents\\my_file.xlsx')

1 Comment

As of pandas 1.3.0, read_excel supports .xlsx via the default openpyxl engine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.