Ignore an optional word if present in a string - regular expression in python

Question

I'm trying to match a string with regular expression using Python, but ignore an optional word if it's present.

For example, I have the following lines:

First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]

I'm looking to capture everything before [Ignore This Part]. Notice I also want to exclude the whitespace before [Ignore This Part]. Therefore my results should look like this:

First string
Second string
Third string (1)

I have tried the following regular expression with no luck, because it still captures [Ignore This Part]:

.+(?:\s\[.+\])?

Any assistance would be appreciated.

I'm using python 3.8 on Window 10.

Edit: The examples are meant to be processed one line at a time.

Does "processed one line at a time" refer to successive matches against a single string (such as with re.findall), or is the string broke up so that each line is a separate item in a list/yielded by a generator/other similar? — outis
– outis, Commented Sep 24, 2022 at 4:29
What are the criteria for what is matched and what is ignored? Is it merely that a trailing square-bracketed phrase, if present, is ignored? What if there are multiple square-bracketed sections? What if there's an unclosed open square bracket? — outis
– outis, Commented Sep 24, 2022 at 4:30

Barmar · Accepted Answer · 2022-07-26 18:02:02Z

2

Use [^[] instead of . so it doesn't match anything with square brackets and doesn't match across newlines.

^[^[\n]+(?\s\[.+\])?

DEMO

edited Jul 26, 2022 at 18:02

answered Jul 26, 2022 at 17:49

Barmar

789k57 gold badges554 silver badges669 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

howdoicode Over a year ago

This didn't work for me. I tried in Python and on https://regex101.com/r/yh58jO/1

The fourth bird · Accepted Answer · 2022-07-26 19:01:19Z

2

Perhaps you can remove the part that you don't want to match:

[^\S\n]*\[[^][\n]*]$

Explanation

[^\S\n]* Match optional spaces
\[[^][\n]*] Match from [....]
$ End of string

Regex demo

Example

import re

pattern = r"[^\S\n]*\[[^][\n]*]$"

s = ("First string\n"
            "Second string [Ignore This Part]\n"
            "Third string (1) [Ignore This Part]")

result = re.sub(pattern, "", s, 0, re.M)

if result:
    print(result)

Output

First string
Second string
Third string (1)

If you don't want to be left with an empty string, you can assert a non whitespace char to the left:

(?<=\S)[^\S\n]*\[[^][\n]*]$

Regex demo

answered Jul 26, 2022 at 19:01

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

2 Comments

howdoicode Over a year ago

This didn't work for me. Each line is meant to be processed one line at a time, not as whole. There is no \n character.

The fourth bird Over a year ago

@howdoicode What exactly did not work? It removes the part from the end of the line. The [^\n] means that it does not match a newline. If it always processes 1 line, then you could write it as \s*\[[^][]*]$ See regex101.com/r/qwcqM3/1 If there can be multiple occurrences, then you can remove the anchor regex101.com/r/CUcqi3/1

RavinderSingh13 · Accepted Answer · 2022-07-26 19:15:22Z

2

With your shown samples, please try following code, written and tested in Python3.

import re
var="""First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]"""

[x for x in list(map(lambda x:x.strip(),re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var))) if x]

Output will be as follows, in form of list which could be accessed as per requirement.

['First string', 'Second string', 'Third string (1)']

Here is the complete detailed explanation for above Python3 code:

Firstly using re module's split function where passing regex (.*?)(?:$|\s\[[^]]*\]) with multiline reading flag enabled. This is complete function of split: re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var)
Then passing its output to a lambda function to use strip function to remove elements which are having new lines in it.
Applying map to it and creating list from it.
Then simply removing NULL items from list to get only required part as per OP.

edited Jul 26, 2022 at 19:15

answered Jul 26, 2022 at 18:38

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

4 Comments

howdoicode Over a year ago

This didn't work for me. Each line is meant to be processed one line at a time, not as whole. There is no \n character.

RavinderSingh13 Over a year ago

@howdoicode, ok if there are no new lines then how are lines separated? Or you are running this over another function etc?

howdoicode Over a year ago

It' supposed to just check each line one at a time. Could be in a loop, reading one line at a time, for example.

RavinderSingh13 Over a year ago

@howdoicode, ok not sure about your complete backend functionality but if you put this then also it should work, don't go with new lines thing here, please try it out once in your loop etc and let me know how it goes

The fourth bird · Accepted Answer · 2022-07-26 19:01:45Z

1

You may use this regex:

^.+?(?=$|\s*\[[^]]*]$)

RegEx Demo

If you want better performing regex then I suggest:

^\S+(?:\s+\S+)*?(?=$|\s*\[[^]]*]$)

RegEx Demo 2

RegEx Details:

^: Start
.+?: Match 1+ of any characters (lazy match)
(?=: Start lookahead
- $: End
- |: OR
- \s*: Match 0 or more whitespaces
- \[[^]]*]: Match [...] text
- $: End
): Close lookahead

edited Jul 26, 2022 at 19:01

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

answered Jul 26, 2022 at 18:06

anubhava

790k67 gold badges603 silver badges671 bronze badges

Collectives™ on Stack Overflow

Ignore an optional word if present in a string - regular expression in python

4 Answers 4

1 Comment

2 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related