4

I have a very large string. There are many paragraphs inside that string. Each paragraph starts with a title and follows a particular pattern.

Example:

== Title1 == // Paragraph starts ............. ............. // Some texts ............. End of Paragraph ===Title2 === // Paragraph starts ............. ............. // Some texts .............

The pattern of the title are:

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.

Can anyone help me with how to do this with regex? TIA

2

4 Answers 4

4

You may use

re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)

See the regex demo

Details

  • (?m)^ - start of a line
  • =+ - 1 or more = chars
  • [^\S\r\n]* - zero or more whitespace chars other than CR and LF
  • (.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible
  • [^\S\r\n]* - zero or more whitespace chars other than CR and LF
  • =+ - 1 or more = chars
  • \s* - 0+ whitespaces
  • (.*(?:\r?\n(?!==+.*?=).*)*) - Group 2:
    • .* - any zero or more chars, other than line break chars, as many as possible
    • (?:\r?\n(?!=+.*?=).*)* - zero or more sequences of
      • \r?\n(?!=+.*?=) - an optional CR and then LF that is not followed with 1+ =s, then any chars other than line break chars as few as possible and then again 1+ =s
      • .* - any zero or more chars, other than line break chars, as many as possible

Python demo:

import re

rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)"
s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............."
print(re.findall(rx, s))

Output:

[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]
Sign up to request clarification or add additional context in comments.

Comments

2

May be this helps for finding each paragraphs Title and lines of each paragraph.

text = """== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
"""
import re

reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)')

for i in text.split('\n'):
    if re.search(reg, i):
        t = re.sub(r'=', '', i)
        print('Title:', t.strip())
    else:
        print('line:', i.strip())

 # Output like this
   Title: Title1  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: End of Paragraph
   Title: Title2  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: 

1 Comment

Thanks a lot! I upvoted your answer, but I found the last answer most suitable, hence i accepted it.
2

You could try this -

x = "== Title1   ==="
ptrn = "[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"
if re.search(ptrn, x):
    x = x.replace('=', '').strip()

Will give you Title1. Or suppose you wanted all the matching titles in a list, you could do -

x = '== Title1   ===nansnsk fnasasklsanlkas lkaslkans \n== Title2 ==='
titles = [i.replace('=', '').strip() for i in re.findall(ptrn, x)]
# OP ['Title1', 'Title2']

The pattern is -

"^[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"

^[=]{1,} - match at least one equal sign at the start

[\s]{0,} - match between zero to unlimited equal signs

[\w]+ - match [a-zA-Z0-9_] once or more

After which we can extract text from this by replacing = with '' and stripping it off spaces. You could try it at regex101 which is really helpful when testing regex

1 Comment

Thanks a lot! I upvoted your answer, but I found the last answer most suitable, hence i accepted it.
1

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

This can be represented by =+.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

So the pattern for the title becomes: =+[^=]+=+\n, which means, match at least one =, then some text not including =, then again at least one =.

Catching everything between those patterns will give you desired text.

In below pattern, whole match includes title, first group contains the text.

So finally, your pattern wuld be: =+[^=]+=+\n([\w\W]+\n)(?==+[^=]+=+\n)

Demo

1 Comment

Thanks a lot! I upvoted your answer, but I found the last answer most suitable, hence i accepted it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.