3

I’m trying to read headings from multiple docx files. Annoyingly, these headings do not have an identifiable paragraph style. All paragraphs have ‘normal’ paragraph styling so I am using regex. The headings are formatted in bold and are structured as follows:

A. Cat

B. Dog

C. Pig

D. Fox

If there are more than 26 headings in a file then the headings would be preceded with ‘AA.’, ‘BB.’ etc

I have the following code, which kind of works except any heading preceded by ‘D.’ prints twice, e.g. [Cat, Dog, Pig, Fox, Fox]

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

Can anyone tell me if there is something wrong with the code that is causing this to happen? As far as I can tell, there is nothing unique about the formatting or structure of these particular headings in the Word files.

2 Answers 2

2

what's happening is the the loop is continuing past D.Fox, and so in this new loop, even though there is no match, it is printing the last value of head1, which is D.Fox.

I think it is the for run in paragraph.runs: that is somehow running twice, maybe there's a second "run" that is there but invisible?

Perhaps adding a break when the first match is found is enough to prevent the second run triggering?

for file in os.listdir(directory):

document = Document(directory+file)

head1s = []

for paragraph in document.paragraphs:

    heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

    for run in paragraph.runs:

        if run.bold:

            if heading:
                head1 = paragraph.text
                head1 = head1.split('.')[1]
                head1s.append(head1)
                # this break stops the run loop if a match was found.
                break

print(head1s)
Sign up to request clarification or add additional context in comments.

Comments

0

You can also run use the style.name from the same library

def find_headings(doc_path):
#find headings in doc
doc = docx.Document(doc_path)
headings = []
for i, para in doc.paragraphs:
    if para.style.name == 'Heading 1':
        headings.append(para.text)
return headings

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.