Regex between digits and strings

Question

I am trying to extract all digits referring to teaching experience, which should be 8, 17, 7. I have tried (years.*?teaching.*?:.*?[0-9]+|\d+.+teaching) but it grabs everything from the first digit because of the second condition.

Sample text:

10+ years small business ownership, 10+ years sme consulting, 10+ years corporate/vocational business training, 8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience

Something like \b\d+\+?(?:,\s?\d+\+?)*(?= years teaching\b)? Or perhaps \b\d+\+?(?=(?:,\s?\d+\+?)* years teaching\b) if you want "17" and "7+" to be captured separately. — 41686d6564
– 41686d6564, Commented Sep 18, 2022 at 0:03
Using a single RegEx for this will just end up with a very complex / hard to read regex. Why not split things up first? Divide around the comma's, if there is teaching experience in there, then get the number. — Maarten Bodewes
– Maarten Bodewes, Commented Sep 18, 2022 at 0:06
@w.Palestine Just to mention, I see a little problem eg here. Including what's after comma might cause unwanted matches. — bobble bubble
– bobble bubble, Commented Sep 18, 2022 at 0:39

aghashamim · Accepted Answer · 2022-09-18 01:20:15Z

2

Keeping your regex as it is, it would be nice to approach it in a different way. I would rather break things apart and then try the regex on smaller string instead.

import re

input = '10+ years small business ownership, 10+ years sme consulting, 10+ years corporate/vocational business training, 8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience'

regex = re.compile('(years.*?teaching.*?:.*?[0-9]+|\d+.+teaching)');

lines = input.split(',')
filteredLines = filter(lambda line: 'teaching' in line, lines)
experiences = map(lambda line: regex.match(line.strip()).group(), filteredLines);

print(list(experiences))

You could further modify this to fit your needs.

edited Sep 18, 2022 at 1:20

answered Sep 18, 2022 at 0:14

aghashamim

5863 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lajos Arpad Over a year ago

This is a Python question.

aghashamim Over a year ago

@LajosArpad My bad, I missed out on that earlier. Thank you for spotting that. I have updated my answer and provided the same solution in python now.

bobble bubble · Accepted Answer · 2022-09-18 09:39:18Z

2

With the following assumptions:

Each comma separated substring contains not more than one number
teaching is always related to experience years in separated substrings

An idea with a lookahad (for use with re.findall , re.I flag to ignore case)

re.findall(r"(?:,|^)(?=[^,]*?teaching)[^\d,]*(\d+\+?)", s, flags=re.I)

(?:,|^) Starting point is either ^ start of string or a comma
(?=[^,]*?teaching) Condition to check if teaching occurs before next ,
On success [^\d,]*(\d+\+?) capture the number and optional + to the first group

See this demo at regex101 (more info on right side) or a Python demo at tio.run

edited Sep 18, 2022 at 9:39

answered Sep 18, 2022 at 1:00

bobble bubble

18.8k4 gold badges32 silver badges52 bronze badges

3 Comments

GDN Over a year ago

Given the proximity of "teaching" word, I've tried something like this: (?:([0-9]+)(?:\+)? (?:.){1,20}teaching)|(?:teaching(?:.){1,20}?([0-9]+)(?: |,)) which works. But your answer is the way to go.

bobble bubble Over a year ago

@GDN Thank you for the comment! Maybe you want to add that as an answer as well. Looks like there is some unnecessary grouping in your pattern and I'd probably use [^,] instead of the dot for not skipping over commas. Optimized a bit your pattern is considerably more efficient especially on larger strings than using the lookahaed though it has two capturing groups.

GDN Over a year ago

Thank you for your comment and encouragement. I posted an answer which relies both on your suggestion on regex optimization and your code example. I used list comprehension to get rid of two capturing groups.

GDN · Accepted Answer · 2022-09-18 19:04:34Z

Motivated by @bobble bubble's encouragement, I propose this regex (polished up after bobble bubble's comment):

([0-9]+).{1,15}teaching|teaching.{1,15}?([0-9]+)

Given the proximity of "teaching" to the number of years this regex splits the match in two parts:

"teaching" comes after number of year but in close reach (any character within 1 to 15 positions).
"teaching" comes first; note .{1,15}?; ? at the end is not greedy otherwise it would match also "1" in "experience: 17"

The drawback is it generates two groups. You can get rid of it using python as follows:

import re

s = "10+ years small business ownership, 10+ years sme consulting, 10+ years corporate/vocational business training, 8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience"

matches = re.findall(r"([0-9]+).{1,15}teaching|teaching.{1,15}?([0-9]+)", s)

matches = [int(x) if x != '' else int(y) for (x, y) in matches]

print(matches)  # A list of teaching years as int

LetzerWille · Accepted Answer · 2022-09-18 10:44:08Z

0

print(re.sub(r'.*?training,\s','', txt))

8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience

answered Sep 18, 2022 at 10:44

LetzerWille

5,6965 gold badges26 silver badges28 bronze badges

Collectives™ on Stack Overflow

Regex between digits and strings

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related