1

I am trying to extract all digits referring to teaching experience, which should be 8, 17, 7. I have tried (years.*?teaching.*?:.*?[0-9]+|\d+.+teaching) but it grabs everything from the first digit because of the second condition.

Sample text:

10+ years small business ownership, 10+ years sme consulting, 10+ years corporate/vocational business training, 8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience

3
  • 1
    Something like \b\d+\+?(?:,\s?\d+\+?)*(?= years teaching\b)? Or perhaps \b\d+\+?(?=(?:,\s?\d+\+?)* years teaching\b) if you want "17" and "7+" to be captured separately. Commented Sep 18, 2022 at 0:03
  • Using a single RegEx for this will just end up with a very complex / hard to read regex. Why not split things up first? Divide around the comma's, if there is teaching experience in there, then get the number. Commented Sep 18, 2022 at 0:06
  • @w.Palestine Just to mention, I see a little problem eg here. Including what's after comma might cause unwanted matches. Commented Sep 18, 2022 at 0:39

4 Answers 4

2

Keeping your regex as it is, it would be nice to approach it in a different way. I would rather break things apart and then try the regex on smaller string instead.

import re

input = '10+ years small business ownership, 10+ years sme consulting, 10+ years corporate/vocational business training, 8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience'

regex = re.compile('(years.*?teaching.*?:.*?[0-9]+|\d+.+teaching)');

lines = input.split(',')
filteredLines = filter(lambda line: 'teaching' in line, lines)
experiences = map(lambda line: regex.match(line.strip()).group(), filteredLines);

print(list(experiences))

You could further modify this to fit your needs.

Sign up to request clarification or add additional context in comments.

2 Comments

This is a Python question.
@LajosArpad My bad, I missed out on that earlier. Thank you for spotting that. I have updated my answer and provided the same solution in python now.
2

With the following assumptions:

  • Each comma separated substring contains not more than one number
  • teaching is always related to experience years in separated substrings

An idea with a lookahad (for use with re.findall , re.I flag to ignore case)

re.findall(r"(?:,|^)(?=[^,]*?teaching)[^\d,]*(\d+\+?)", s, flags=re.I)
  • (?:,|^) Starting point is either ^ start of string or a comma
  • (?=[^,]*?teaching) Condition to check if teaching occurs before next ,
  • On success [^\d,]*(\d+\+?) capture the number and optional + to the first group

See this demo at regex101 (more info on right side) or a Python demo at tio.run

3 Comments

Given the proximity of "teaching" word, I've tried something like this: (?:([0-9]+)(?:\+)? (?:.){1,20}teaching)|(?:teaching(?:.){1,20}?([0-9]+)(?: |,)) which works. But your answer is the way to go.
@GDN Thank you for the comment! Maybe you want to add that as an answer as well. Looks like there is some unnecessary grouping in your pattern and I'd probably use [^,] instead of the dot for not skipping over commas. Optimized a bit your pattern is considerably more efficient especially on larger strings than using the lookahaed though it has two capturing groups.
Thank you for your comment and encouragement. I posted an answer which relies both on your suggestion on regex optimization and your code example. I used list comprehension to get rid of two capturing groups.
1

Motivated by @bobble bubble's encouragement, I propose this regex (polished up after bobble bubble's comment):

([0-9]+).{1,15}teaching|teaching.{1,15}?([0-9]+)

Given the proximity of "teaching" to the number of years this regex splits the match in two parts:

  1. "teaching" comes after number of year but in close reach (any character within 1 to 15 positions).
  2. "teaching" comes first; note .{1,15}?; ? at the end is not greedy otherwise it would match also "1" in "experience: 17"

The drawback is it generates two groups. You can get rid of it using python as follows:

import re

s = "10+ years small business ownership, 10+ years sme consulting, 10+ years corporate/vocational business training, 8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience"

matches = re.findall(r"([0-9]+).{1,15}teaching|teaching.{1,15}?([0-9]+)", s)

matches = [int(x) if x != '' else int(y) for (x, y) in matches]

print(matches)  # A list of teaching years as int

Comments

0
print(re.sub(r'.*?training,\s','', txt))

8 years teaching experience, years of teaching experience: 17, 7+ years teaching/Corporate Training experience

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.