Capturing Formatted Numbering from DOCX Files in Python

Ask Question

Asked 1 year, 2 months ago

Modified 1 year, 2 months ago

Viewed 196 times

I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve.

The Problem

I'm trying to extract text from DOCX files, including preserving the formatted numbering (e.g., 1., 1.1, 1.1.1, etc.). However, none of the methods I've tried so far can capture this numbering correctly.

Interestingly, if I manually copy the entire text from the DOCX file and paste it into a new document as plain text, the formatted numbering is preserved. This suggests that the information is there, but I'm not able to access it programmatically.

What I've Tried

Using python-docx library:

from docx import Document

doc = Document('my_file.docx')
for para in doc.paragraphs:
    print(para.text)

This extracts the text but loses the numbering format.

Converting DOCX to HTML and then parsing:

import mammoth
from bs4 import BeautifulSoup

with open("my_file.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value

soup = BeautifulSoup(html, 'html.parser')
text = soup.get_text()
print(text)

This approach also fails to preserve the numbering.

Using docx2txt:

import docx2txt

text = docx2txt.process("my_file.docx")
print(text)

Again, the numbering is lost.

What I'm Looking For

I'm seeking a method to programmatically extract text from a DOCX file that preserves the formatted numbering, similar to what happens when I manually copy and paste the text.

Ideally, I'm looking for a Python solution, but I'm open to other approaches if they can be integrated into a Python workflow.

Questions

Is there a way to access the underlying numbering information in DOCX files using Python?
Are there any libraries or tools that can handle this specific issue?
If there's no direct way to do this in Python, are there any intermediate steps or file conversions that might help preserve this information?

Any insights, suggestions, or solutions would be greatly appreciated. Thank you in advance for your help!

asked Aug 23, 2024 at 23:43

Anshuman Sharma

2

SO is a question and answer site. That means one question per post. In addition, question #2 is a request for a tool or library recommendation, which is expressly indicated as being off-topic in the help center guidelines. You'll find your experiences here will be much better if you take the time to read those help center pages to learn how the site works before you begin posting.

Ken White
– Ken White

2024-08-23 23:54:19 +00:00
Commented Aug 23, 2024 at 23:54
1

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer.

Community
– Community Bot

2024-08-25 13:16:09 +00:00
Commented Aug 25, 2024 at 13:16
When it comes to read numbering from Microsoft Word files most free APIs are not complete and fail doing this task. This is due to the complex way numbering is stored in Word files. In stackoverflow.com/questions/71497815/… I have shown a way to do this using Apache POI. But Python-docx is much more incomplete in this regard. Therefore, there is no way to translate this approach directly to Python docx without requiring page-by-page code to read the numbering XML at a very low level.

Axel Richter
– Axel Richter

2024-08-26 04:20:07 +00:00
Commented Aug 26, 2024 at 4:20

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Capturing Formatted Numbering from DOCX Files in Python

The Problem

What I've Tried

What I'm Looking For

Questions

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

The Problem

What I've Tried

What I'm Looking For

Questions

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked