1

I need a regex that captures 2 groups: a movie and the year. Optionally, there could be a 'from ' string between them.

My expected results are:

first_query="matrix 2013" => ('matrix', '2013')
second_query="matrix from 2013" => ('matrix', '2013')
third_query="matrix" => ('matrix', None)

I've done 2 simulations on https://regex101.com/ for python3: I- r"(.+)(?:from ){0,1}([1-2]\d{3})" Doesn't match first_query and third_query, also doesn't omit 'from' in group one, which is what I want to avoid.

II- r"(.+)(?:from ){1}([1-2]\d{3})" Works with second_query, but does not match first_query and third_query.

Is it possible to match all three strings, omitting the 'from ' string from the first group?

Thanks in advance.

3 Answers 3

3

You may use

^(.+?)(?:\s+(?:from\s+)?([12]\d{3}))?$

See the regex demo

Details

  • ^ - start of a string
  • (.+?) - Group 1: any 1+ chars other than line break chars, as few as possible
  • (?:\s+(?:from\s+)?([12]\d{3}))? - an optional non-capturing group matching 1 or 0 occurrences of:
    • \s+ - 1+ whitespaces
    • (?:from\s+)? - an optional sequence of from substring followed with 1+ whitespaces
    • ([12]\d{3}) - Group 2: 1 or 2 followed with 3 digits
  • $ - end of string.
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the clarification!
2

This will output your patters, but have a space too much in from of the number:

import re

pat = r"^(.+?)(?: from)? ?(\d+)?$"


text = """matrix 2013
matrix from 2013
matrix"""

for t in text.split("\n"):
    print(re.findall(pat,t))

Output:

[('matrix', '2013')]
[('matrix', '2013')]
[('matrix', '')]

Explanation:

 ^           start of string
(.+?)        lazy anythings as few as possible
(?: from)?   non-grouped optional ` from`
 ?           optional space
(\d+=)?$     optional digits till end of string

Demo: https://regex101.com/r/VD0SZb/1

Comments

1
import re

pattern = re.compile( r"""
    ^\s*              # start of string (optional whitespace)
    (?P<title>\S+)    # one or more non-whitespace characters (title)
    (?:\s+from)?      # optionally, some space followed by the word 'from'
    \s*               # optional whitespace
    (?P<year>[0-9]+)? # optional digit string (year)
    \s*$              # end of string (optional whitespace)
""", re.VERBOSE )

for query in [ 'matrix 2013', 'matrix from 2013', 'matrix' ]:
    m = re.match( pattern, query )
    if m: print( m.groupdict() )

# Prints:
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': None}

Disclaimer: this regex does not contain the logic necessary to reject the first two matches on the grounds that The Matrix actually came out in 1999.

1 Comment

Also a nice solution! It is a good way to document regex code by breaking it in lines.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.