I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve.
The Problem
I'm trying to extract text from DOCX files, including preserving the formatted numbering (e.g., 1., 1.1, 1.1.1, etc.). However, none of the methods I've tried so far can capture this numbering correctly.
Interestingly, if I manually copy the entire text from the DOCX file and paste it into a new document as plain text, the formatted numbering is preserved. This suggests that the information is there, but I'm not able to access it programmatically.
What I've Tried
Using
python-docxlibrary:from docx import Document doc = Document('my_file.docx') for para in doc.paragraphs: print(para.text)This extracts the text but loses the numbering format.
Converting DOCX to HTML and then parsing:
import mammoth from bs4 import BeautifulSoup with open("my_file.docx", "rb") as docx_file: result = mammoth.convert_to_html(docx_file) html = result.value soup = BeautifulSoup(html, 'html.parser') text = soup.get_text() print(text)This approach also fails to preserve the numbering.
Using
docx2txt:import docx2txt text = docx2txt.process("my_file.docx") print(text)Again, the numbering is lost.
What I'm Looking For
I'm seeking a method to programmatically extract text from a DOCX file that preserves the formatted numbering, similar to what happens when I manually copy and paste the text.
Ideally, I'm looking for a Python solution, but I'm open to other approaches if they can be integrated into a Python workflow.
Questions
- Is there a way to access the underlying numbering information in DOCX files using Python?
- Are there any libraries or tools that can handle this specific issue?
- If there's no direct way to do this in Python, are there any intermediate steps or file conversions that might help preserve this information?
Any insights, suggestions, or solutions would be greatly appreciated. Thank you in advance for your help!