So I'm reading in a pdf and outputting the code to an xml file. The code is based on https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467
However I changed it up to output to xml as I prefer it for the overall project I have in mind and the tags aren't correct either so will modify that another time.
I'm struggling to understand how to use regex to extract Chapter and Sub topic numbers such as 1.2.
Note: There isn't spaces in the tags, they're just there to stop the tags being removed in this question.
elements = "< s1 >2. CURRENT STATE OF THE ART|
< s1 >2.1. Blah blah blah|
< p >This is the main text.This is the main text.This is the main text."
path = "PUT YOUR PATH"
title = filename.rsplit(".", 1)[0]
open((path + title + ".xml"), "w+")
with open((path + title + '.xml'), "a") as f:
for p in elements:
f.writelines("\t"+p + "\n\t")
ch = re.search("[%d.%d]", str(elements))
print(ch)
elementscan't be split across multiple lines like that. You should use a triple-quoted"""literal. I am also not sure why you needed to add space in<s1>.