0

I would like to group string in this format:

Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5

I would like to group it from BEGIN to END with it, like that:

Some_text Some_text 1 2 3
<!-- START -->
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END <!-- END --> Some_text

Some_Text Some_text 1 4 5

<!-- START --> and <!-- END --> - this is just a comment on the start and end of grouping. I want to get only text between BEGIN and END

I have something like that, but it doesn't work for every case - when there is a lot of data, it just doesn't work:

reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL)
core = re.search(reg, text).group(1)
lines = core.split("\n")

text is my string and then after grouping I exchange it for a list - I don't know how to make this regex directly from the list, then I would not have to do it on string text but on python list text

Give me some tips or help how I can solve it.

Sample code:

import re
text="Some_text Some_text 1 2 3\nBEGIN Some_text Some_text\n44 76 1321\nSome_text Some_text\nEND Some_text\nSome_Text Some_text 1 4 5"

begin = "BEGIN"
end = "END"
reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL)
core = re.search(reg, text).group(1)
lines = core.split("\n")

print(lines)

It works but I don't know why sometimes it doesn't, when it takes a lot of text e.g: 20k words I want to get only text between BEGIN and END

11
  • It would be helpful if you had minimal working code that could be copy and pasted and produced the results that were incorrect to you. Not sure why you are using the r' ' raw string format. That could cause problems with backslashes. Commented May 10, 2020 at 20:47
  • try : rf"^BEGIN[.\n]*\nEND" Commented May 10, 2020 at 20:54
  • @hacker315: [.] is just the literal . -- not the regex metacharacter... Commented May 10, 2020 at 20:55
  • @BobbyOcean Sample code snippet - I'm not able to upload the whole because it is only a fragment of the project, but the main idea is preserved here. I don't fully understand why it doesn't work every time, although it should - I am certainly doing something wrong. text="Some_text Some_text 1 2 3\nBEGIN Some_text Some_text\n44 76 1321\nSome_text Some_text\nEND Some_text\nSome_Text Some_text 1 4 5" begin = "BEGIN" end = "END" reg = re.compile(rf"{begin}[\-\s]+(.*)\n{end}", re.DOTALL) core = re.search(reg, text).group(1) lines = core.split("\n") Commented May 10, 2020 at 20:59
  • What do you mean by: text is my string and then after grouping I exchange it for a list - I don't know how to make this regex directly from the list, then I would not have to do it on string text but on python list text Commented May 10, 2020 at 21:23

3 Answers 3

1

You might use

^BEGIN\b(.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*)\r?\nEND

Regex demo | Python demo

If you want to include BEGIN and END, you can omit the capturing group

^BEGIN\b.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*\r?\nEND

Regex demo | Python demo

Code example

import re

regex = r"^BEGIN\b(.*(?:\r?\n(?!(?:BEGIN|END)\b).*)*)\r?\nEND"

test_str = ("Some_text Some_text 1 2 3\n"
    "BEGIN Some_text Some_text\n"
    "44 76 1321\n"
    "Some_text Some_text\n"
    "END Some_text\n"
    "Some_Text Some_text 1 4 5\n")

print(re.findall(regex, test_str, re.MULTILINE))

Output

[' Some_text Some_text\n44 76 1321\nSome_text Some_text']
Sign up to request clarification or add additional context in comments.

8 Comments

it should work but how do you do it with a list? Where is the whole string letter [0]?
@DeepSea All the found items are in the return value of re.findall which will Return all non-overlapping matches of pattern in string, as a list of strings. What do you mean by do it with a list?
The point is that I take a long words - I take the text from a text file - sometimes it has 5 pages, sometimes 60 - differently. I write each page as x element of the list, e.g. the whole 1 page of the file is list[0], the whole other page of the file is list[1] etc. I want this regex to work on the whole pages - that is if I take the text from 5 pages, I have the list[0-4] and I want the regex to extract data also between the lists - for example, if BEGIN is in the list [1] and END in list[3] I want to pull everything out - from BEGIN in list[1] to the whole [2] up to END in [3]
@DeepSea You can run the regex on each item in the list to get matches 0-4 per list item. If you want to also run it over all the list items (so all the pages) you can first merge all the items to a single item and then run the regex again.
Then you can take the nth occurrence in of the result list.
|
0

This works:

txt='''\
Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5'''

import re

print(re.sub(r'(?=BEGIN )(.*END)',r'<!-- START -->\n\1 <!-- END -->',txt,flags=re.S))

Or,

print(re.sub(r'(?=^BEGIN )([\s\S]*END)',r'<!-- START -->\n\1 <!-- END -->',txt, flags=re.M))

Either prints:

Some_text Some_text 1 2 3
<!-- START -->
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END <!-- END --> Some_text
Some_Text Some_text 1 4 5

1 Comment

Sorry I wrote it a little wrong - I added the sample code to the first entry for clarification
0

This uses a non-greedy pattern to match everything from the beginning marker to the end marker, including the markers. The \bs in the regex pattern are to make sure the BEGIN and END aren't part of a longer word, e.g., so "BEGIN" won't match "BEGINS" or "BEGINNING". Note: it may not work properly for input with mismatched markers, such as "a b c BEGIN d e BEGIN 1 2 END 3" (two BEGINs).

import re

txt='''\
Some_text Some_text 1 2 3
BEGIN Some_text Some_text
44 76 1321
Some_text Some_text
END Some_text
Some_Text Some_text 1 4 5'''

begin = 'BEGIN'
end = 'END'

regex = re.compile(rf"(?<=\b{begin}\b)(.*?)(?=\b{end}\b)", flags=re.DOTALL)

match = regex.search(txt)

if match:
    print(match[1])

3 Comments

<!-- START --> and <!-- END --> - this is just a comment on the start and end of grouping. I want to get only text between BEGIN and END
@DeepSea, I misunderstood what you wanted. I think the revised answer is what you are after.
how to do it on the list? In my case, it limits a little better, but it still breaks when there are for example: two enters or other additional whitespace - it seems to me that it is best to limit on list and not change it to string, because then it breaks this way. I also don't know how to display specific groupings if there are more BEGIN <-> END groups

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.