0

I'm working on a Python project where I need to extract text from DOCX files, preserving the formatted numbering. I've encountered a peculiar issue that I'm hoping someone can help me solve.

The Problem

I'm trying to extract text from DOCX files, including preserving the formatted numbering (e.g., 1., 1.1, 1.1.1, etc.). However, none of the methods I've tried so far can capture this numbering correctly.

Interestingly, if I manually copy the entire text from the DOCX file and paste it into a new document as plain text, the formatted numbering is preserved. This suggests that the information is there, but I'm not able to access it programmatically.

What I've Tried

  1. Using python-docx library:

    from docx import Document
    
    doc = Document('my_file.docx')
    for para in doc.paragraphs:
        print(para.text)
    

    This extracts the text but loses the numbering format.

  2. Converting DOCX to HTML and then parsing:

    import mammoth
    from bs4 import BeautifulSoup
    
    with open("my_file.docx", "rb") as docx_file:
        result = mammoth.convert_to_html(docx_file)
        html = result.value
    
    soup = BeautifulSoup(html, 'html.parser')
    text = soup.get_text()
    print(text)
    

    This approach also fails to preserve the numbering.

  3. Using docx2txt:

    import docx2txt
    
    text = docx2txt.process("my_file.docx")
    print(text)
    

    Again, the numbering is lost.

What I'm Looking For

I'm seeking a method to programmatically extract text from a DOCX file that preserves the formatted numbering, similar to what happens when I manually copy and paste the text.

Ideally, I'm looking for a Python solution, but I'm open to other approaches if they can be integrated into a Python workflow.

Questions

  1. Is there a way to access the underlying numbering information in DOCX files using Python?
  2. Are there any libraries or tools that can handle this specific issue?
  3. If there's no direct way to do this in Python, are there any intermediate steps or file conversions that might help preserve this information?

Any insights, suggestions, or solutions would be greatly appreciated. Thank you in advance for your help!

3
  • 2
    SO is a question and answer site. That means one question per post. In addition, question #2 is a request for a tool or library recommendation, which is expressly indicated as being off-topic in the help center guidelines. You'll find your experiences here will be much better if you take the time to read those help center pages to learn how the site works before you begin posting. Commented Aug 23, 2024 at 23:54
  • 1
    Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Commented Aug 25, 2024 at 13:16
  • When it comes to read numbering from Microsoft Word files most free APIs are not complete and fail doing this task. This is due to the complex way numbering is stored in Word files. In stackoverflow.com/questions/71497815/… I have shown a way to do this using Apache POI. But Python-docx is much more incomplete in this regard. Therefore, there is no way to translate this approach directly to Python docx without requiring page-by-page code to read the numbering XML at a very low level. Commented Aug 26, 2024 at 4:20

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.