regex matching whitespace characters and multiple optional patterns before start of text

Question

I am parsing strings containing code like the following. It can start with an empty lines followed by multiple optional patterns. These patterns can either be python-style inline comments (using a hash # character), or the command "!mycommand", and both must start at the beginning of a line. How can I write a regex matching up to the starting of the code?

mystring = """

# catch this comment
!mycommand
# catch this comment
#catch this comment too
!mycommand

# catch this comment
!mycommand
!mycommand

some code. match until the previous line
# do not catch this comment
!mycommand
# do not catch this comment
"""

import re
pattern = r'^\s*^#.*|!mycommand\s*'
m = re.search(pattern, mystring, re.MULTILINE)
mystring[m.start():m.end()]

mystring = 'code. do not match anything' + mystring
m = re.search(pattern, mystring, re.MULTILINE)

I want the regex to match the string up to "some code. catch until the previous line". I tried different things but I am probably stuck with the two multiple patterns

The fourth bird · Accepted Answer · 2020-07-02 14:02:38Z

2

Without the need of re.MULTILINE you could repeatedly match 0+ whitespace chars before and after the match

^(?:\s*(?:#.*|!mycommand\s*))+\s*

Regex demo | Python demo

For example

import re
m = re.search(r'^(?:\s*(?:#.*|!mycommand\s*))+\s*', mystring)
print(m.group())

edited Jul 2, 2020 at 14:02

answered Jul 2, 2020 at 13:56

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sweeper · Accepted Answer · 2020-07-02 13:45:04Z

1

Your pattern matches one instance of # ... or !mycommand. One way to solve this problem is to put all of them into one match, and use re.search to find the first match.

To do this, you need to repeat the part that matches # ... or !mycommand using *:

^\s*^(?:#.*\s*|!mycommand\s*)*

I have also changed #.* to #.*\s* so that it goes all the way to the next line where a non-whitespace is found.

Demo

Responding to your comment:

if the string begins with code, this regex should not match anything

You can try:

\A\s*^(?:#.*\s*|!mycommand\s*)+

I changed to \A so that it only matches the absolute start of the string, instead of start of line. I also changed the last * to + so at least one # ... or !mycommand has to be present.

edited Jul 2, 2020 at 13:45

answered Jul 2, 2020 at 13:27

Sweeper

292k23 gold badges260 silver badges438 bronze badges

7 Comments

aless80 Over a year ago

two small problems: 1) if the string begins with code, this regex should not match anything; 2) the newline character before the code should be included in the match

Sweeper Over a year ago

When you say "begins with code", is the empty line at the start still there? The newline character before the code is included in the match. @aless80

aless80 Over a year ago

yes, you are right about point 1 and it is what i need. now let me fix point two. adding a \s* does not help

Sweeper Over a year ago

@aless80 See the edit. Did I understand correctly? If I misunderstood, and the answer to my previous comment were actually "no", change the first * to +, rather than the last *.

aless80 Over a year ago

The edit on the absolute start \A is a good one, but I do not want the last +. At that point what is missing is to catch any newline/space characters just before the code line, which in my example is "some code. match until the previous line"

|

Wiktor Stribiżew · Accepted Answer · 2020-07-02 15:29:16Z

1

Matching and returning the comments at the start of the string

No need for a regex, read and append the lines to list until a line that does not start with ! or # occurs and ignore all blank lines:

mystring = "YOUR_STRING_HERE"

results = []
for line in mystring.splitlines():
  if not line.strip():                                      # Skip blank lines
    continue
  if not line.startswith('#') and not line.startswith('!'): # Reject if does not start with ! or #
    break
  else:
    results.append(line)                                    # Append comment

print(results)

See the Python demo. Results:

['# catch this comment', '!mycommand', '# catch this comment', '#catch this comment too', '!mycommand', '# catch this comment', '!mycommand', '!mycommand']

Removing the comments at the start of the string

results = []
flag = False
for line in mystring.splitlines():
  if not flag and not line.strip():
    continue
  if not flag and not line.startswith('#') and not line.startswith('!'):
    flag = True
  if flag:
    results.append(line)

print("\n".join(results))

Output:

some code. match until the previous line
# do not catch this comment
!mycommand
# do not catch this comment

See this Python demo.

Regex approach

import re
print(re.sub(r'^(?:(?:[!#].*)?\n)+', '', mystring))

If there are optional indenting spaces at the start of a line add [^\S\n]*:

print(re.sub(r'^(?:[^\S\n]*(?:[!#].*)?\n)+', '', mystring, count=1))

See the regex demo and the Python demo. count=1 will make sure we just remove the first match (you need no check all other lines).

Regex details

^ - start of string
(?:[^\S\n]*(?:[!#].*)?\n)+ - 1 or more occurrences of
- [^\S\n]* - optional horizontal whitespaces
- (?:[!#].*)? - an optional sequence of
  - [!#] - ! or #
  - .* - the rest of the line
- \n - a newline char.

edited Jul 2, 2020 at 15:29

answered Jul 2, 2020 at 13:29

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

6 Comments

aless80 Over a year ago

indeed it is probably a good idea to avoid regex. Let me think about it

aless80 Over a year ago

I did not write that in my question, but my end goal is actually to have the code part of my string. In other words I want to remove the initial newlines, comments, and !mycommand. I edited your code so that result looks like the mystring (with \n or spaces), then I "substract" result from mystring. however I am not sure how robust this method will be

Wiktor Stribiżew Over a year ago

@aless80 So do you mean you need to remove all up to some code. match until the previous line?

Wiktor Stribiżew Over a year ago

@aless80 You may still get it without a regex, but it is up to you to choose the path.

aless80 Over a year ago

The downside of your 2nd approach is that the for loop has to go through all the lines. In my approach I get the matching in a way similar to your first approach, then do mystring[len(result):]. However I think I am opting for the regex approach (hopefully more "reliable")

|

Collectives™ on Stack Overflow

regex matching whitespace characters and multiple optional patterns before start of text

3 Answers 3

Comments

7 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

7 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related