0

So I'm reading in a pdf and outputting the code to an xml file. The code is based on https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

However I changed it up to output to xml as I prefer it for the overall project I have in mind and the tags aren't correct either so will modify that another time.

I'm struggling to understand how to use regex to extract Chapter and Sub topic numbers such as 1.2.

Note: There isn't spaces in the tags, they're just there to stop the tags being removed in this question.

elements =  "< s1 >2. CURRENT STATE OF THE ART|
             < s1 >2.1. Blah blah blah|
             < p >This is the main text.This is the main text.This is the main text."

    path = "PUT YOUR PATH"
    title = filename.rsplit(".", 1)[0]
    open((path + title + ".xml"), "w+")

    with open((path + title + '.xml'), "a") as f:
        for p in elements:

            f.writelines("\t"+p + "\n\t")

            ch = re.search("[%d.%d]", str(elements))
            print(ch)
2
  • in.mathworks.com/matlabcentral/answers/… I am not marking this as an answer because it I'm just referring it to you. Commented Dec 23, 2021 at 13:23
  • You have in addition to indentation errors, your input string for elements can't be split across multiple lines like that. You should use a triple-quoted """ literal. I am also not sure why you needed to add space in <s1>. Commented Dec 25, 2021 at 13:40

1 Answer 1

1

Use as your regex:

<[^<>]+>(\d+)\.(\d*)

The chapter number will be in capture group 1 and the sub-topic number, which may be an empty string, will be in capture group 2.

import re

elements =  """<s1>2. CURRENT STATE OF THE ART|
             <s1>2.1. Blah blah blah|
             <p>This is the main text.This is the main text.This is the main text."""

regex = r'''(?x)# ignore whitespace in regular expressions
<               # Matches '<'
[^<>]+          # Matches one or more characters that are neither '<' nor '>'
>               # Matches '>'
(\d+)           # Capture group 1: Matches 1 or more digits
\.              # Matches '.'
(\d*)           # Capture group 2: Matches 0 or more digits
'''

chapters = re.findall(regex, elements)
for chapter in chapters:
    print(f'{chapter[0]}.{chapter[1]}')

Prints:

2.
2.1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.