Parsing docx files in Python

Question

I’m trying to read headings from multiple docx files. Annoyingly, these headings do not have an identifiable paragraph style. All paragraphs have ‘normal’ paragraph styling so I am using regex. The headings are formatted in bold and are structured as follows:

A. Cat

B. Dog

C. Pig

D. Fox

If there are more than 26 headings in a file then the headings would be preceded with ‘AA.’, ‘BB.’ etc

I have the following code, which kind of works except any heading preceded by ‘D.’ prints twice, e.g. [Cat, Dog, Pig, Fox, Fox]

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

Can anyone tell me if there is something wrong with the code that is causing this to happen? As far as I can tell, there is nothing unique about the formatting or structure of these particular headings in the Word files.

glycoaddict · Accepted Answer · 2018-05-24 11:03:58Z

what's happening is the the loop is continuing past D.Fox, and so in this new loop, even though there is no match, it is printing the last value of head1, which is D.Fox.

I think it is the for run in paragraph.runs: that is somehow running twice, maybe there's a second "run" that is there but invisible?

Perhaps adding a break when the first match is found is enough to prevent the second run triggering?

for file in os.listdir(directory):

document = Document(directory+file)

head1s = []

for paragraph in document.paragraphs:

    heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

    for run in paragraph.runs:

        if run.bold:

            if heading:
                head1 = paragraph.text
                head1 = head1.split('.')[1]
                head1s.append(head1)
                # this break stops the run loop if a match was found.
                break

print(head1s)

user1890239 · Accepted Answer · 2022-06-23 19:01:38Z

0

You can also run use the style.name from the same library

def find_headings(doc_path):
#find headings in doc
doc = docx.Document(doc_path)
headings = []
for i, para in doc.paragraphs:
    if para.style.name == 'Heading 1':
        headings.append(para.text)
return headings

answered Jun 23, 2022 at 19:01

user1890239

794 bronze badges

Collectives™ on Stack Overflow

Parsing docx files in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related