5

I already have implemented HTML to DOCX in Python where I have parsed HTML using BeautifulSoup. I traversed each and every HTML tag recursively and then by using Python-Docx library, I created Docx document.

Now I want to do the reverse thing and convert Docx to HTML string. I read about reading existing document by using Python Docx library (https://python-docx.readthedocs.io/en/latest/user/documents.html). However, I could not find an approach to traverse each document object and convert them into HTML string.

Is there any way where I can do such reverse parsing? I have tried libraries https://pypi.org/project/docx2html/ and https://pypi.org/project/mammoth/. However, I found them ignoring some styles and I would like to write the code on my self instead of using the library.

Any help is greatly appreciated.

2

1 Answer 1

4

Here solution for converting DOCX to HTML through Windows COM (OLE) MS Office interface:

import win32com.client
import win32com.client.dynamic


class WordSaveFormat:
    wdFormatNone = None
    wdFormatHTML = 8


class WordOle:
    def __init__(self, filename):
        self.filename = filename
        self.word_app = win32com.client.dynamic.Dispatch("Word.Application")
        self.word_doc = self.word_app.Documents.Open(filename)

    def save(self, new_filename=None, word_save_format=WordSaveFormat.wdFormatNone):
        if new_filename:
            self.filename = new_filename
            self.word_doc.SaveAs(new_filename, word_save_format)
        else:
            self.word_doc.Save()

    def close(self):
        self.word_doc.Close(SaveChanges=0)
        # self.word_app.DoClose( SaveChanges = 0 )
        # self.word_app.Close()
        del self.word_app

    def show(self):
        self.word_app.Visible = 1

    def hide(self):
        self.word_app.Visible = 0


word_ole = WordOle("D:\\TestDoc.docx")
word_ole.show()
word_ole.save("D:\\TestDoc.html", WordSaveFormat.wdFormatHTML)
# word_ole.save( "D:\\TestDoc2.docx", WordSaveFormat.wdFormatNone )
word_ole.close()
Sign up to request clarification or add additional context in comments.

3 Comments

Great solution, very helpful and clean code. Worked like a charm :)
Is it possible to convert to HTML but also keeping any images embedded into the HTML file instead of being linked with an external folder?
@sugarakis - did you find a solution to your question?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.