Convert DOCX TO HTML Programmatically using Python

Question

I already have implemented HTML to DOCX in Python where I have parsed HTML using BeautifulSoup. I traversed each and every HTML tag recursively and then by using Python-Docx library, I created Docx document.

Now I want to do the reverse thing and convert Docx to HTML string. I read about reading existing document by using Python Docx library (https://python-docx.readthedocs.io/en/latest/user/documents.html). However, I could not find an approach to traverse each document object and convert them into HTML string.

Is there any way where I can do such reverse parsing? I have tried libraries https://pypi.org/project/docx2html/ and https://pypi.org/project/mammoth/. However, I found them ignoring some styles and I would like to write the code on my self instead of using the library.

Any help is greatly appreciated.

Possible here solution: stackoverflow.com/questions/125222/… and here github.com/mwilliamson/python-mammoth — Rufat
– Rufat, Commented Sep 6, 2019 at 7:14
Also possible access to MS Office SaveAs Html function through Windows COM (OLE) interface. — Rufat
– Rufat, Commented Sep 6, 2019 at 7:44

Rufat · Accepted Answer · 2023-04-08 04:16:38Z

4

Here solution for converting DOCX to HTML through Windows COM (OLE) MS Office interface:

import win32com.client
import win32com.client.dynamic


class WordSaveFormat:
    wdFormatNone = None
    wdFormatHTML = 8


class WordOle:
    def __init__(self, filename):
        self.filename = filename
        self.word_app = win32com.client.dynamic.Dispatch("Word.Application")
        self.word_doc = self.word_app.Documents.Open(filename)

    def save(self, new_filename=None, word_save_format=WordSaveFormat.wdFormatNone):
        if new_filename:
            self.filename = new_filename
            self.word_doc.SaveAs(new_filename, word_save_format)
        else:
            self.word_doc.Save()

    def close(self):
        self.word_doc.Close(SaveChanges=0)
        # self.word_app.DoClose( SaveChanges = 0 )
        # self.word_app.Close()
        del self.word_app

    def show(self):
        self.word_app.Visible = 1

    def hide(self):
        self.word_app.Visible = 0


word_ole = WordOle("D:\\TestDoc.docx")
word_ole.show()
word_ole.save("D:\\TestDoc.html", WordSaveFormat.wdFormatHTML)
# word_ole.save( "D:\\TestDoc2.docx", WordSaveFormat.wdFormatNone )
word_ole.close()

edited Apr 8, 2023 at 4:16

answered Sep 6, 2019 at 8:28

Rufat

7031 gold badge13 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

sugarakis Over a year ago

Great solution, very helpful and clean code. Worked like a charm :)

sugarakis Over a year ago

Is it possible to convert to HTML but also keeping any images embedded into the HTML file instead of being linked with an external folder?

gtomer Over a year ago

@sugarakis - did you find a solution to your question?

Collectives™ on Stack Overflow

Convert DOCX TO HTML Programmatically using Python

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related