Regex pattern for string - python

Question

I would like to group string in this format:

Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5

I would like to group it from BEGIN to END with it, like that:

Some_text Some_text 1 2 3
<!-- START -->
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END <!-- END --> Some_text

Some_Text Some_text 1 4 5

 and  - this is just a comment on the start and end of grouping. I want to get only text between BEGIN and END

I have something like that, but it doesn't work for every case - when there is a lot of data, it just doesn't work:

reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL)
core = re.search(reg, text).group(1)
lines = core.split("\n")

text is my string and then after grouping I exchange it for a list - I don't know how to make this regex directly from the list, then I would not have to do it on string text but on python list text

Give me some tips or help how I can solve it.

Sample code:

import re
text="Some_text Some_text 1 2 3\nBEGIN Some_text Some_text\n44 76 1321\nSome_text Some_text\nEND Some_text\nSome_Text Some_text 1 4 5"

begin = "BEGIN"
end = "END"
reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL)
core = re.search(reg, text).group(1)
lines = core.split("\n")

print(lines)

It works but I don't know why sometimes it doesn't, when it takes a lot of text e.g: 20k words I want to get only text between BEGIN and END

It would be helpful if you had minimal working code that could be copy and pasted and produced the results that were incorrect to you. Not sure why you are using the r' ' raw string format. That could cause problems with backslashes. — Bobby Ocean
– Bobby Ocean, Commented May 10, 2020 at 20:47
@hacker315: [.] is just the literal . -- not the regex metacharacter... — dawg
– dawg, Commented May 10, 2020 at 20:55
@BobbyOcean Sample code snippet - I'm not able to upload the whole because it is only a fragment of the project, but the main idea is preserved here. I don't fully understand why it doesn't work every time, although it should - I am certainly doing something wrong. text="Some_text Some_text 1 2 3\nBEGIN Some_text Some_text\n44 76 1321\nSome_text Some_text\nEND Some_text\nSome_Text Some_text 1 4 5" begin = "BEGIN" end = "END" reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL) core = re.search(reg, text).group(1) lines = core.split("\n") — DeepSea
– DeepSea, Commented May 10, 2020 at 20:59
What do you mean by: text is my string and then after grouping I exchange it for a list - I don't know how to make this regex directly from the list, then I would not have to do it on string text but on python list text — dawg
– dawg, Commented May 10, 2020 at 21:23

The fourth bird · Accepted Answer · 2020-05-11 08:57:02Z

1

You might use

^BEGIN\b(.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*)\r?\nEND

Regex demo | Python demo

If you want to include BEGIN and END, you can omit the capturing group

^BEGIN\b.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*\r?\nEND

Regex demo | Python demo

Code example

import re

regex = r"^BEGIN\b(.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*)\r?\nEND"

test_str = ("Some_text Some_text 1 2 3\n"
    "BEGIN Some_text Some_text\n"
    "44 76 1321\n"
    "Some_text Some_text\n"
    "END Some_text\n"
    "Some_Text Some_text 1 4 5\n")

print(re.findall(regex, test_str, re.MULTILINE))

Output

[' Some_text Some_text\n44 76 1321\nSome_text Some_text']

answered May 11, 2020 at 8:57

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

DeepSea Over a year ago

it should work but how do you do it with a list? Where is the whole string letter [0]?

The fourth bird Over a year ago

@DeepSea All the found items are in the return value of re.findall which will Return all non-overlapping matches of pattern in string, as a list of strings. What do you mean by do it with a list?

DeepSea Over a year ago

The point is that I take a long words - I take the text from a text file - sometimes it has 5 pages, sometimes 60 - differently. I write each page as x element of the list, e.g. the whole 1 page of the file is list[0], the whole other page of the file is list[1] etc. I want this regex to work on the whole pages - that is if I take the text from 5 pages, I have the list[0-4] and I want the regex to extract data also between the lists - for example, if BEGIN is in the list [1] and END in list[3] I want to pull everything out - from BEGIN in list[1] to the whole [2] up to END in [3]

The fourth bird Over a year ago

@DeepSea You can run the regex on each item in the list to get matches 0-4 per list item. If you want to also run it over all the list items (so all the pages) you can first merge all the items to a single item and then run the regex again.

The fourth bird Over a year ago

Then you can take the nth occurrence in of the result list.

|

dawg · Accepted Answer · 2020-05-10 20:57:49Z

0

This works:

txt='''\
Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5'''

import re

print(re.sub(r'(?=BEGIN )(.*END)',r'<!-- START -->\n\1 <!-- END -->',txt,flags=re.S))

Or,

print(re.sub(r'(?=^BEGIN )([\s\S]*END)',r'<!-- START -->\n\1 <!-- END -->',txt, flags=re.M))

Either prints:

Some_text Some_text 1 2 3
<!-- START -->
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END <!-- END --> Some_text
Some_Text Some_text 1 4 5

edited May 10, 2020 at 20:57

answered May 10, 2020 at 20:50

dawg

105k24 gold badges142 silver badges217 bronze badges

1 Comment

DeepSea Over a year ago

Sorry I wrote it a little wrong - I added the sample code to the first entry for clarification

RootTwo · Accepted Answer · 2020-05-11 01:24:42Z

0

This uses a non-greedy pattern to match everything from the beginning marker to the end marker, including the markers. The \bs in the regex pattern are to make sure the BEGIN and END aren't part of a longer word, e.g., so "BEGIN" won't match "BEGINS" or "BEGINNING". Note: it may not work properly for input with mismatched markers, such as "a b c BEGIN d e BEGIN 1 2 END 3" (two BEGINs).

import re

txt='''\
Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5'''

begin = 'BEGIN'
end = 'END'

regex = re.compile(rf"(?<=\b{begin}\b)(.*?)(?=\b{end}\b)", flags=re.DOTALL)

match = regex.search(txt)

if match:
    print(match[1])

edited May 11, 2020 at 1:24

answered May 10, 2020 at 21:22

RootTwo

4,4361 gold badge13 silver badges15 bronze badges

3 Comments

DeepSea Over a year ago

 and  - this is just a comment on the start and end of grouping. I want to get only text between BEGIN and END

RootTwo Over a year ago

@DeepSea, I misunderstood what you wanted. I think the revised answer is what you are after.

DeepSea Over a year ago

how to do it on the list? In my case, it limits a little better, but it still breaks when there are for example: two enters or other additional whitespace - it seems to me that it is best to limit on list and not change it to string, because then it breaks this way. I also don't know how to display specific groupings if there are more BEGIN <-> END groups

Collectives™ on Stack Overflow

Regex pattern for string - python

3 Answers 3

8 Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related