0

I am parsing strings containing code like the following. It can start with an empty lines followed by multiple optional patterns. These patterns can either be python-style inline comments (using a hash # character), or the command "!mycommand", and both must start at the beginning of a line. How can I write a regex matching up to the starting of the code?

mystring = """

# catch this comment
!mycommand
# catch this comment
#catch this comment too
!mycommand

# catch this comment
!mycommand
!mycommand

some code. match until the previous line
# do not catch this comment
!mycommand
# do not catch this comment
"""

import re
pattern = r'^\s*^#.*|!mycommand\s*'
m = re.search(pattern, mystring, re.MULTILINE)
mystring[m.start():m.end()]

mystring = 'code. do not match anything' + mystring
m = re.search(pattern, mystring, re.MULTILINE)

I want the regex to match the string up to "some code. catch until the previous line". I tried different things but I am probably stuck with the two multiple patterns

0

3 Answers 3

2

Without the need of re.MULTILINE you could repeatedly match 0+ whitespace chars before and after the match

^(?:\s*(?:#.*|!mycommand\s*))+\s*

Regex demo | Python demo

For example

import re
m = re.search(r'^(?:\s*(?:#.*|!mycommand\s*))+\s*', mystring)
print(m.group())
Sign up to request clarification or add additional context in comments.

Comments

1

Your pattern matches one instance of # ... or !mycommand. One way to solve this problem is to put all of them into one match, and use re.search to find the first match.

To do this, you need to repeat the part that matches # ... or !mycommand using *:

^\s*^(?:#.*\s*|!mycommand\s*)*

I have also changed #.* to #.*\s* so that it goes all the way to the next line where a non-whitespace is found.

Demo

Responding to your comment:

if the string begins with code, this regex should not match anything

You can try:

\A\s*^(?:#.*\s*|!mycommand\s*)+

I changed to \A so that it only matches the absolute start of the string, instead of start of line. I also changed the last * to + so at least one # ... or !mycommand has to be present.

7 Comments

two small problems: 1) if the string begins with code, this regex should not match anything; 2) the newline character before the code should be included in the match
When you say "begins with code", is the empty line at the start still there? The newline character before the code is included in the match. @aless80
yes, you are right about point 1 and it is what i need. now let me fix point two. adding a \s* does not help
@aless80 See the edit. Did I understand correctly? If I misunderstood, and the answer to my previous comment were actually "no", change the first * to +, rather than the last *.
The edit on the absolute start \A is a good one, but I do not want the last +. At that point what is missing is to catch any newline/space characters just before the code line, which in my example is "some code. match until the previous line"
|
1

Matching and returning the comments at the start of the string

No need for a regex, read and append the lines to list until a line that does not start with ! or # occurs and ignore all blank lines:

mystring = "YOUR_STRING_HERE"

results = []
for line in mystring.splitlines():
  if not line.strip():                                      # Skip blank lines
    continue
  if not line.startswith('#') and not line.startswith('!'): # Reject if does not start with ! or #
    break
  else:
    results.append(line)                                    # Append comment

print(results)

See the Python demo. Results:

['# catch this comment', '!mycommand', '# catch this comment', '#catch this comment too', '!mycommand', '# catch this comment', '!mycommand', '!mycommand']

Removing the comments at the start of the string

results = []
flag = False
for line in mystring.splitlines():
  if not flag and not line.strip():
    continue
  if not flag and not line.startswith('#') and not line.startswith('!'):
    flag = True
  if flag:
    results.append(line)

print("\n".join(results))

Output:

some code. match until the previous line
# do not catch this comment
!mycommand
# do not catch this comment

See this Python demo.

Regex approach

import re
print(re.sub(r'^(?:(?:[!#].*)?\n)+', '', mystring))

If there are optional indenting spaces at the start of a line add [^\S\n]*:

print(re.sub(r'^(?:[^\S\n]*(?:[!#].*)?\n)+', '', mystring, count=1))

See the regex demo and the Python demo. count=1 will make sure we just remove the first match (you need no check all other lines).

Regex details

  • ^ - start of string
  • (?:[^\S\n]*(?:[!#].*)?\n)+ - 1 or more occurrences of
    • [^\S\n]* - optional horizontal whitespaces
    • (?:[!#].*)? - an optional sequence of
      • [!#] - ! or #
      • .* - the rest of the line
    • \n - a newline char.

6 Comments

indeed it is probably a good idea to avoid regex. Let me think about it
I did not write that in my question, but my end goal is actually to have the code part of my string. In other words I want to remove the initial newlines, comments, and !mycommand. I edited your code so that result looks like the mystring (with \n or spaces), then I "substract" result from mystring. however I am not sure how robust this method will be
@aless80 So do you mean you need to remove all up to some code. match until the previous line?
@aless80 You may still get it without a regex, but it is up to you to choose the path.
The downside of your 2nd approach is that the for loop has to go through all the lines. In my approach I get the matching in a way similar to your first approach, then do mystring[len(result):]. However I think I am opting for the regex approach (hopefully more "reliable")
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.