I’m trying to read headings from multiple docx files. Annoyingly, these headings do not have an identifiable paragraph style. All paragraphs have ‘normal’ paragraph styling so I am using regex. The headings are formatted in bold and are structured as follows:
A. Cat
B. Dog
C. Pig
D. Fox
If there are more than 26 headings in a file then the headings would be preceded with ‘AA.’, ‘BB.’ etc
I have the following code, which kind of works except any heading preceded by ‘D.’ prints twice, e.g. [Cat, Dog, Pig, Fox, Fox]
import os
from docx import Document
import re
directory = input("Copy and paste the location of the files.\n").lower()
for file in os.listdir(directory):
document = Document(directory+file)
head1s = []
for paragraph in document.paragraphs:
heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)
for run in paragraph.runs:
if run.bold:
if heading:
head1 = paragraph.text
head1 = head1.split('.')[1]
head1s.append(head1)
print(head1s)
Can anyone tell me if there is something wrong with the code that is causing this to happen? As far as I can tell, there is nothing unique about the formatting or structure of these particular headings in the Word files.