I want to find all the occurrences of a specific term (and its variations) in a word document. These are the steps:
- Extract the text from the word document
- Try to find pattern via regex
Extract Text from Word Document
The document variable contains the extracted text with the following function getText(filename):
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
Find Pattern with Regex
The pattern consists of words that start with DOC- and after the hyphen- there are 9 digits.
I have tried the following without success:
with start and end line markers
pattern = re.compile('^DOC\.\d{9}$') pattern.findall(document)without
pattern = re.compile('DOC\.\d{9}') pattern.findall(document)
Can someone help me?
document)?