2

I am trying to read an excel file in python without using pandas or xlrd, and I have been trying to convert the results from bytes to utf-8 without any success.

data from xls file

colA    colB    colC
spc     1D0     20190705
spd     1D0     20190705
spe     1D0     20190705
... (goes on for 500k lines)

code

with open(file, 'rb') as f:
    data = f.readlines(1)  # Just to check the first line that is printed out
    print(data[0].decode('utf-8'))

The error I receive is UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

If I were to print data without decoding it, the result is: [b'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00>\x00\x03\x00\xfe\xff\t\x00\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x9e\x00\x00\x00\x9dN\x00\x00\x00\x00\x00\x00\x00\x10\x00\x00\xfe\xff\xff\xff\x00\x00\x00\x00\xfeM\x00\x00\x01\x00\x00\x00\xffM\x00\x00\x00N\x00\x00\x01N\x00\x00\x02N\x00\x00\x03N\x00\x00\x04N\x00\x00\x05N\x00\x00\x06N\x00\x00\x07N\x00\x00\x08N\x00\x00\tN\x00\x00\n']

There isn't any reason why I don't want to use pandas or xlrd, I am just trying to parse the data with just the standard libraries if required.

Any thoughts?

3
  • The error tells there is a specific character in the Excel file that cannot be decoded with 'utf-8'. Try using a different encoder, but still its not known what sort of characters maybe lurking around in the doc. Perhaps, you should give pandas a try: pd.read_excel(file) and see what you get. Commented Jul 8, 2019 at 8:10
  • 3
    Excel is a binary format, not plain-text. If you don't want to use xlrd or pd.read_excel, you'll have to reimplement what those libraries do. Commented Jul 8, 2019 at 8:11
  • 1
    Even if you want to parse .xlsx files, which are considerably easier than .xls, you still have quite a bit of work to do. I guess you are doing this as a learning exercise? If so, then I think you should take a look at this question to find out where to read about the .xlsx specifications. If you are truly trying to learn about .xls files, I urge you to reconsider. There are plenty of other things you could be learning about that are more useful and less painful. Commented Jul 12, 2019 at 21:30

2 Answers 2

2

You need to unzip the xlsx file first, before you can read its contents (assuming that is the format you are using).

Sign up to request clarification or add additional context in comments.

3 Comments

Ideally, you should show some code how to do this (eg. using the std-lib zipfile module) and then how to proceed, once the xlsx archive is unpacked (which file to process, how to access the data of a cell etc.)
it would probably be wise to wait for a confirmation that xlsx is indeed the format the OP is trying to read before embarking in such an enterprise...
See also this comment in another thread, presenting a solution to reading an `*.xlsx* Excel file using just standard library functionality.
-4

Try this

with open('D:\dew.csv','rt') as f:
#This will print every line one by one 
data = csv.reader(f)
for r in data:
    print(r) 
    f.close()

1 Comment

From the description the OP has given (though they have not been specific), this does not appear to be answering the question posed. Your solution is for a text based file, the OP appears to be struggling with an (assumed) .xls or .xlsx file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.