2

I am matching the movie titles which usually are in the form

[BLA VLA] The Matrix 1999 bla bla [bla bla]

My regex is

match = re.match("\[?.*?\](.*?)([0-9]{4})(.*)\[?.*\]?", title)

This works fine for most of time but it fails for movies like

[bla bla] 1990 The Bronx Warriors 1982
[ bl bla] 2012 2009 [ bla bla ]

How can i fix that

1
  • match = re.match("\[?.*?\](.*)([0-9]{4})(.*)\[?.*\]?", title) . You were almost there. Now the first group will match movie title, and the second group it's year. Commented Jun 11, 2019 at 4:56

4 Answers 4

2

If we would be having the same uppercase and lowercase patterns similar to those listed in the question, we would be starting with a simple expression, such as:

([A-Z][a-z]+\s)+

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"([A-Z][a-z]+\s)+"

test_str = ("[bla bla] 1990 The Bronx Warriors 1982\n"
    "[ bl bla] 2012 2009 [ bla bla ]\n"
    "[BLA VLA] The Matrix 1999 bla bla [bla bla]\n")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx

If this expression wasn't desired or you wish to modify it, please visit regex101.com.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

For you example data, one option could be using 2 capturing groups:

\[[^\]]+\] (.+?) (\d{4})

Explanation

  • \[[^\]]+\] Match part with square brackets
  • (.+?) Capture in group 1matching a space, 1+ times any char non greedy and space
  • (\d{4}) Capture in group 2 matching 4 digits

Regex demo

Comments

0

Try this

re.match( r"\[.*?\]\s([\w\s]+)", title).groups()[0].strip()

Code

Going further, consider reusing your code in a function. Here is equivalent code:

import re


def get_title(s):
    """Return the title from a string."""
    pattern = r"\[.*?\]\s([\w\s]+)"
    p = re.compile(pattern)
    m = p.match(s)
    g = m.groups()
    return g[0].strip()

Demo

get_title("[BLA VLA] The Matrix 1999 bla bla [bla bla]")
# 'The Matrix 1999 bla bla'

get_title("[bla bla] 1990 The Bronx Warriors 1982")
# '1990 The Bronx Warriors 1982'

get_title("[ bl bla] 2012 2009 [ bla bla ]")
# '2012 2009'

Details

See the pattern here:

  • \[.*?\]\s: beyond the leading brackets and whitespace
  • ([\w\s]+): capture optional alpha-numerals and whitespace

2 Comments

Sorry i didn't explain fully , i want to extract title and year in separate groups like i have in my regex
To be clear, can you explicitly add examples of sample inputs and their expected outputs?
0
movies = '''[bla bla] 1990 The Bronx Warriors 1982
[ bl bla] 2012 2009 [ bla bla ]
[ bl bla] Normal movie title 2009 [ bla bla ]'''

import re

for movie, year in re.findall(r']\s+(.*)\s+(\d{4}).*?$', movies, flags=re.MULTILINE):
    print('Movie title: [{}] Movie year: [{}]'.format(movie, year))

Prints:

Movie title: [1990 The Bronx Warriors] Movie year: [1982]
Movie title: [2012] Movie year: [2009]
Movie title: [Normal movie title] Movie year: [2009]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.