1

I'm trying to match a string with regular expression using Python, but ignore an optional word if it's present.

For example, I have the following lines:

First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]

I'm looking to capture everything before [Ignore This Part]. Notice I also want to exclude the whitespace before [Ignore This Part]. Therefore my results should look like this:

First string
Second string
Third string (1)

I have tried the following regular expression with no luck, because it still captures [Ignore This Part]:

.+(?:\s\[.+\])?

Any assistance would be appreciated.

I'm using python 3.8 on Window 10.

Edit: The examples are meant to be processed one line at a time.

2
  • Does "processed one line at a time" refer to successive matches against a single string (such as with re.findall), or is the string broke up so that each line is a separate item in a list/yielded by a generator/other similar? Commented Sep 24, 2022 at 4:29
  • What are the criteria for what is matched and what is ignored? Is it merely that a trailing square-bracketed phrase, if present, is ignored? What if there are multiple square-bracketed sections? What if there's an unclosed open square bracket? Commented Sep 24, 2022 at 4:30

4 Answers 4

2

Use [^[] instead of . so it doesn't match anything with square brackets and doesn't match across newlines.

^[^[\n]+(?\s\[.+\])?

DEMO

Sign up to request clarification or add additional context in comments.

1 Comment

This didn't work for me. I tried in Python and on https://regex101.com/r/yh58jO/1
2

Perhaps you can remove the part that you don't want to match:

[^\S\n]*\[[^][\n]*]$

Explanation

  • [^\S\n]* Match optional spaces
  • \[[^][\n]*] Match from [....]
  • $ End of string

Regex demo

Example

import re

pattern = r"[^\S\n]*\[[^][\n]*]$"

s = ("First string\n"
            "Second string [Ignore This Part]\n"
            "Third string (1) [Ignore This Part]")

result = re.sub(pattern, "", s, 0, re.M)

if result:
    print(result)

Output

First string
Second string
Third string (1)

If you don't want to be left with an empty string, you can assert a non whitespace char to the left:

(?<=\S)[^\S\n]*\[[^][\n]*]$

Regex demo

2 Comments

This didn't work for me. Each line is meant to be processed one line at a time, not as whole. There is no \n character.
@howdoicode What exactly did not work? It removes the part from the end of the line. The [^\n] means that it does not match a newline. If it always processes 1 line, then you could write it as \s*\[[^][]*]$ See regex101.com/r/qwcqM3/1 If there can be multiple occurrences, then you can remove the anchor regex101.com/r/CUcqi3/1
2

With your shown samples, please try following code, written and tested in Python3.

import re
var="""First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]"""

[x for x in list(map(lambda x:x.strip(),re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var))) if x]

Output will be as follows, in form of list which could be accessed as per requirement.

['First string', 'Second string', 'Third string (1)']

Here is the complete detailed explanation for above Python3 code:

  • Firstly using re module's split function where passing regex (.*?)(?:$|\s\[[^]]*\]) with multiline reading flag enabled. This is complete function of split: re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var)
  • Then passing its output to a lambda function to use strip function to remove elements which are having new lines in it.
  • Applying map to it and creating list from it.
  • Then simply removing NULL items from list to get only required part as per OP.

4 Comments

This didn't work for me. Each line is meant to be processed one line at a time, not as whole. There is no \n character.
@howdoicode, ok if there are no new lines then how are lines separated? Or you are running this over another function etc?
It' supposed to just check each line one at a time. Could be in a loop, reading one line at a time, for example.
@howdoicode, ok not sure about your complete backend functionality but if you put this then also it should work, don't go with new lines thing here, please try it out once in your loop etc and let me know how it goes
1

You may use this regex:

^.+?(?=$|\s*\[[^]]*]$)

RegEx Demo

If you want better performing regex then I suggest:

^\S+(?:\s+\S+)*?(?=$|\s*\[[^]]*]$)

RegEx Demo 2

RegEx Details:

  • ^: Start
  • .+?: Match 1+ of any characters (lazy match)
  • (?=: Start lookahead
    • $: End
    • |: OR
    • \s*: Match 0 or more whitespaces
    • \[[^]]*]: Match [...] text
    • $: End
  • ): Close lookahead

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.