0

I have a text like this

s = """
...

(1) Literature

1. a.
2. b.
3. c.

...
"""

I want to cut Literature section but I have some problem with detection.

I use here

re.search("(1) Literature\n\n(.*).\n\n", s).group(1)

but search return None.

Desire output is

(1) Literature

1. a.
2. b.
3. c.

 

What did I do wrong?

2
  • 2
    What is your desired output? Commented Jul 14, 2021 at 15:33
  • 3
    Probably you need r'\(1\)\s+Literature\s+((?:.+\n)+)' Commented Jul 14, 2021 at 15:36

4 Answers 4

2

You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.

\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)

The pattern matches:

  • \(1\) Literature\n\n Match (1) Literature and 2 newlines
  • ( Capture group 1
    • (?: Non capture group
      • \d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
    • )+ Close non capture group and repeat it 1 or more times to match all the lines
  • ) Close group 1

Regex demo


Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.

\(1\) Literature((?:\n(?!\(\d+\)).*)*)

Regex demo

Sign up to request clarification or add additional context in comments.

Comments

1

Parentheses have a special meaning in regex. They are used to group matches.

(1) - Capture 1 as the first capturing group.

Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.

Check Demo

Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.

\(1\) Literature(.*\n){5}

Regex Demo

Note the scape characters used on parentheses around 1.

EDIT

Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.

\(1\)\sLiterature\n+((.*\n){3})

This regex will capture the below string in capturing group 1.

1. a.
2. b.
3. c.

Regex Demo

1 Comment

Most likely they want to capture the Literature part in a group, and the choices in a group, they do not want to capture newlines.
1

You may use this regex with a capture group:

r'\(1\)\s+Literature\s+((?:.+\n)+)'

RegEx Demo

Explanation:

  • \(1\): Match (1) text
  • \s+: Match 1+ whitespaces
  • Literature:
  • \s+:
  • (: Start capture group #1
    • (?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
  • ): End capture group #1

Comments

0

Regex for capturing the generic question with that structure:

\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)

It will capture the title "Literature", then the choices in another group (for a total of 2 groups).

It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:

((\d+\.\s+.+)\n)+) then globally match to get all groups.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.