Python docx - AttributeError: 'bytes' object has no attribute 'seek'

Question

What I have as input: docx document raw bytes in byte64 format.
What I am trying to achieve: extract text from this document for further processing.
I tried to follow this answer: extracting text from MS word files in python

My code fragment:

base64_bytes = input.encode('utf-8')
decoded_data = base64.decodebytes(base64_bytes)
document = Document(decoded_data)
docText = '\n\n'.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])

The document = Document(decoded_data) line gives me the following error: AttributeError: 'bytes' object has no attribute 'seek'
The decoded_data is in the following format: b'PK\\x03\\x04\\x14\\x00\\x08\\x08\\x08\\x00\\x87@CP\\x00...

How should I format the raw data to extract text from docx?

input.encode('utf-8'). Is this your actual code? Because this is trying to encode the function object input as UTF-8 — clubby789
– clubby789, Commented Feb 6, 2020 at 11:10
1) Your title says "seek", your question says "code". Which is it? 2) What exactly is Document and what kind of argument does it expect? — deceze
– deceze ♦, Commented Feb 6, 2020 at 11:11
You say you are following the advise Use the native Python docx module... and then -- you do not follow it. You do not need to encode, decode, or even explicitly load the file 'manually'. — Jongware
– Jongware, Commented Feb 6, 2020 at 11:13
@usr2564301 they only diverge where they have to, their input is in-memory base64 content rather than a file on disk. — Masklinn
– Masklinn, Commented Feb 6, 2020 at 11:17

Community · Accepted Answer · 2020-06-20 09:12:55Z

19

From the official documentation, emphasis mine:

docx.Document(docx=None)

Return a Document object loaded from docx, where docx can be either a path to a .docx file (a string) or a file-like object. If docx is missing or None, the built-in default document “template” is loaded.

So if you provide a string or string-like parameter it is interpreted as the path to a docx file. To provide the contents from memory, you need to pass in a file-like object aka a BytesIO instance (the entire point of StringIO and BytesIO being to "convert" strings and bytes to file-like objects):

document = Document(io.BytesIO(decoded_data))

side-note: you probably want to remove the .encode call in the list comprehension, in Python 3 text (str) and bytes (bytes) are not compatible at all, so the line is going to blow up when you try to concatenate bytes (encoded text) with textual separators.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 6, 2020 at 11:13

Masklinn

43.7k4 gold badges58 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Sean Richards Over a year ago

Doing this, I get an Exception: BadZipFile: File is not a zip file. I've tried creating a ZipFile class instance with the BytesIO(decoded_data) object but I get a different error doing so. Any thoughts?

Masklinn Over a year ago

That your file is not an actual docx? Possibly a "legacy" doc file? A docx is a zipfile, so internally docx.Document will open the zipfile and start parsing its content.

Sean Richards Over a year ago

That's the tough part here is that I turned a known .docx file into base64 to test passing it via a JSON request body. Ingesting the base64 string, decoding it into a BytesIO instance is result in that error. Womp womp. I think I'll probably post a question. Thanks for the response!

Collectives™ on Stack Overflow

Python docx - AttributeError: 'bytes' object has no attribute 'seek'

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related