Python regex matching digits after text

Question

I am matching the movie titles which usually are in the form

[BLA VLA] The Matrix 1999 bla bla [bla bla]

My regex is

match = re.match("\[?.*?\](.*?)([0-9]{4})(.*)\[?.*\]?", title)

This works fine for most of time but it fails for movies like

[bla bla] 1990 The Bronx Warriors 1982
[ bl bla] 2012 2009 [ bla bla ]

How can i fix that

match = re.match("\[?.*?\](.*)([0-9]{4})(.*)\[?.*\]?", title) . You were almost there. Now the first group will match movie title, and the second group it's year. — igrinis
– igrinis, Commented Jun 11, 2019 at 4:56

Emma Marcier · Accepted Answer · 2019-06-11 04:10:26Z

If we would be having the same uppercase and lowercase patterns similar to those listed in the question, we would be starting with a simple expression, such as:

([A-Z][a-z]+\s)+

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"([A-Z][a-z]+\s)+"

test_str = ("[bla bla] 1990 The Bronx Warriors 1982\n"
    "[ bl bla] 2012 2009 [ bla bla ]\n"
    "[BLA VLA] The Matrix 1999 bla bla [bla bla]\n")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx

If this expression wasn't desired or you wish to modify it, please visit regex101.com.

RegEx Circuit

jex.im visualizes regular expressions:

The fourth bird · Accepted Answer · 2019-06-11 07:17:05Z

1

For you example data, one option could be using 2 capturing groups:

\[[^\]]+\] (.+?) (\d{4})

Explanation

\[[^\]]+\] Match part with square brackets
(.+?) Capture in group 1matching a space, 1+ times any char non greedy and space
(\d{4}) Capture in group 2 matching 4 digits

Regex demo

edited Jun 11, 2019 at 7:17

answered Jun 11, 2019 at 7:10

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Comments

pylang · Accepted Answer · 2019-06-11 04:52:39Z

0

Try this

re.match( r"\[.*?\]\s([\w\s]+)", title).groups()[0].strip()

Code

Going further, consider reusing your code in a function. Here is equivalent code:

import re


def get_title(s):
    """Return the title from a string."""
    pattern = r"\[.*?\]\s([\w\s]+)"
    p = re.compile(pattern)
    m = p.match(s)
    g = m.groups()
    return g[0].strip()

Demo

get_title("[BLA VLA] The Matrix 1999 bla bla [bla bla]")
# 'The Matrix 1999 bla bla'

get_title("[bla bla] 1990 The Bronx Warriors 1982")
# '1990 The Bronx Warriors 1982'

get_title("[ bl bla] 2012 2009 [ bla bla ]")
# '2012 2009'

Details

See the pattern here:

\[.*?\]\s: beyond the leading brackets and whitespace
([\w\s]+): capture optional alpha-numerals and whitespace

edited Jun 11, 2019 at 4:52

answered Jun 11, 2019 at 4:40

pylang

45.4k16 gold badges137 silver badges133 bronze badges

2 Comments

rgd Over a year ago

Sorry i didn't explain fully , i want to extract title and year in separate groups like i have in my regex

pylang Over a year ago

To be clear, can you explicitly add examples of sample inputs and their expected outputs?

Andrej Kesely · Accepted Answer · 2019-06-11 04:53:44Z

0

movies = '''[bla bla] 1990 The Bronx Warriors 1982
[ bl bla] 2012 2009 [ bla bla ]
[ bl bla] Normal movie title 2009 [ bla bla ]'''

import re

for movie, year in re.findall(r']\s+(.*)\s+(\d{4}).*?$', movies, flags=re.MULTILINE):
    print('Movie title: [{}] Movie year: [{}]'.format(movie, year))

Prints:

Movie title: [1990 The Bronx Warriors] Movie year: [1982]
Movie title: [2012] Movie year: [2009]
Movie title: [Normal movie title] Movie year: [2009]

answered Jun 11, 2019 at 4:53

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Collectives™ on Stack Overflow

Python regex matching digits after text

4 Answers 4

Demo

Test

RegEx

RegEx Circuit

Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Test

RegEx

RegEx Circuit

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related