How can I find a repetitive decimal string with Regex and Python?

Question

So I'm reading in a pdf and outputting the code to an xml file. The code is based on https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

However I changed it up to output to xml as I prefer it for the overall project I have in mind and the tags aren't correct either so will modify that another time.

I'm struggling to understand how to use regex to extract Chapter and Sub topic numbers such as 1.2.

Note: There isn't spaces in the tags, they're just there to stop the tags being removed in this question.

elements =  "< s1 >2. CURRENT STATE OF THE ART|
             < s1 >2.1. Blah blah blah|
             < p >This is the main text.This is the main text.This is the main text."

    path = "PUT YOUR PATH"
    title = filename.rsplit(".", 1)[0]
    open((path + title + ".xml"), "w+")

    with open((path + title + '.xml'), "a") as f:
        for p in elements:

            f.writelines("\t"+p + "\n\t")

            ch = re.search("[%d.%d]", str(elements))
            print(ch)

in.mathworks.com/matlabcentral/answers/… I am not marking this as an answer because it I'm just referring it to you. — Shravya Boggarapu
– Shravya Boggarapu, Commented Dec 23, 2021 at 13:23
You have in addition to indentation errors, your input string for elements can't be split across multiple lines like that. You should use a triple-quoted """ literal. I am also not sure why you needed to add space in <s1>. — Booboo
– Booboo, Commented Dec 25, 2021 at 13:40

Nimantha · Accepted Answer · 2022-01-03 07:00:13Z

1

Use as your regex:

<[^<>]+>(\d+)\.(\d*)

The chapter number will be in capture group 1 and the sub-topic number, which may be an empty string, will be in capture group 2.

import re

elements =  """<s1>2. CURRENT STATE OF THE ART|
             <s1>2.1. Blah blah blah|
             <p>This is the main text.This is the main text.This is the main text."""

regex = r'''(?x)# ignore whitespace in regular expressions
<               # Matches '<'
[^<>]+          # Matches one or more characters that are neither '<' nor '>'
>               # Matches '>'
(\d+)           # Capture group 1: Matches 1 or more digits
\.              # Matches '.'
(\d*)           # Capture group 2: Matches 0 or more digits
'''

chapters = re.findall(regex, elements)
for chapter in chapters:
    print(f'{chapter[0]}.{chapter[1]}')

Prints:

2.
2.1

edited Jan 3, 2022 at 7:00

Nimantha

6,5376 gold badges32 silver badges78 bronze badges

answered Dec 25, 2021 at 13:24

Booboo

45.7k4 gold badges46 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I find a repetitive decimal string with Regex and Python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related