Extracting text from string using Regex

Question

I have a very large string. There are many paragraphs inside that string. Each paragraph starts with a title and follows a particular pattern.

Example:

== Title1 == // Paragraph starts ............. ............. // Some texts ............. End of Paragraph ===Title2 === // Paragraph starts ............. ............. // Some texts .............

The pattern of the title are:

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.

Can anyone help me with how to do this with regex? TIA

I was trying this solution : stackoverflow.com/questions/1240504/… — Gopal Chitalia
– Gopal Chitalia, Commented Aug 17, 2018 at 11:06

Wiktor Stribiżew · Accepted Answer · 2018-08-17 11:45:57Z

You may use

re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)

See the regex demo

Details

(?m)^ - start of a line
=+ - 1 or more = chars
[^\S\r\n]* - zero or more whitespace chars other than CR and LF
(.*?) - Group 1: any zero or more chars, other than line break chars, as few as possible
[^\S\r\n]* - zero or more whitespace chars other than CR and LF
=+ - 1 or more = chars
\s* - 0+ whitespaces
(.*(?:\r?\n(?!==+.*?=).*)*) - Group 2:
- .* - any zero or more chars, other than line break chars, as many as possible
- (?:\r?\n(?!=+.*?=).*)* - zero or more sequences of
  - \r?\n(?!=+.*?=) - an optional CR and then LF that is not followed with 1+ =s, then any chars other than line break chars as few as possible and then again 1+ =s
  - .* - any zero or more chars, other than line break chars, as many as possible

Python demo:

import re

rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)"
s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............."
print(re.findall(rx, s))

Output:

[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]

utks009 · Accepted Answer · 2018-08-17 11:13:53Z

2

May be this helps for finding each paragraphs Title and lines of each paragraph.

text = """== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
"""
import re

reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)')

for i in text.split('\n'):
    if re.search(reg, i):
        t = re.sub(r'=', '', i)
        print('Title:', t.strip())
    else:
        print('line:', i.strip())

 # Output like this
   Title: Title1  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: End of Paragraph
   Title: Title2  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line:

answered Aug 17, 2018 at 11:13

utks009

5734 silver badges14 bronze badges

1 Comment

Gopal Chitalia Over a year ago

Thanks a lot! I upvoted your answer, but I found the last answer most suitable, hence i accepted it.

Sushant · Accepted Answer · 2018-08-17 11:21:41Z

2

You could try this -

x = "== Title1   ==="
ptrn = "[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"
if re.search(ptrn, x):
    x = x.replace('=', '').strip()

Will give you Title1. Or suppose you wanted all the matching titles in a list, you could do -

x = '== Title1   ===nansnsk fnasasklsanlkas lkaslkans \n== Title2 ==='
titles = [i.replace('=', '').strip() for i in re.findall(ptrn, x)]
# OP ['Title1', 'Title2']

The pattern is -

"^[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"

^[=]{1,} - match at least one equal sign at the start

[\s]{0,} - match between zero to unlimited equal signs

[\w]+ - match [a-zA-Z0-9_] once or more

After which we can extract text from this by replacing = with '' and stripping it off spaces. You could try it at regex101 which is really helpful when testing regex

edited Aug 17, 2018 at 11:21

answered Aug 17, 2018 at 11:05

Sushant

3,6693 gold badges20 silver badges34 bronze badges

1 Comment

Gopal Chitalia Over a year ago

Thanks a lot! I upvoted your answer, but I found the last answer most suitable, hence i accepted it.

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

This can be represented by =+.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

So the pattern for the title becomes: =+[^=]+=+\n, which means, match at least one =, then some text not including =, then again at least one =.

Catching everything between those patterns will give you desired text.

In below pattern, whole match includes title, first group contains the text.

So finally, your pattern wuld be: =+[^=]+=+\n([\w\W]+\n)(?==+[^=]+=+\n)

Demo

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Aug 17, 2018 at 11:04

Michał Turczyn

41.2k18 gold badges58 silver badges87 bronze badges

1 Comment

Gopal Chitalia Over a year ago

Thanks a lot! I upvoted your answer, but I found the last answer most suitable, hence i accepted it.

Collectives™ on Stack Overflow

Extracting text from string using Regex

4 Answers 4

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related