Regex: replace all numbers and "number-like" strings except for years in range

Question

I have the following string:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I want to replace with '' every part of this string which contains a number, except for those parts of the string that are in the year range 1950 to 2025. The resultant string would look like this (don't worry about the extraneous whitespace):

'2014          keep this text      2015 2025 '

So, effectively I want the brute-force removal of anything and everything remotely "numerical," except for something standalone (i.e. not part of another string, and of length 4 excluding whitespace) that resembles a year.

I know I can use this to remove everything containing digits:

re.sub('\w*[0-9]\w*', '', s)

But that doesn't return what I want:

'           keep this text        '

Here's my attempt at replacing anything that doesn't match the patterns listed below:

re.sub(r'^([A-Za-z]+|19[5-9]\d|20[0-1]\d|202[0-5])', '*', s)

Which returns:

'* 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I've been here and here, but wasn't able to find what I was looking for.

No, I'll eventually just strip those out. But that's an easy task. I'm more concerned about the number-like removal excluding years. — boot-scootin
– boot-scootin, Commented Jun 5, 2017 at 14:58

Aran-Fey · Accepted Answer · 2017-06-05 15:09:02Z

2

Regex isn't good at working with numbers. I would ditch regex and use a generator expression:

predicate= lambda w: (w.isdigit() and 1950<=int(w)<=2025) or not any(char.isdigit() for char in w)
print(' '.join(w for w in s.split() if predicate(w)))

answered Jun 5, 2017 at 15:09

Aran-Fey

44k13 gold badges113 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jon Clements Over a year ago

Maybe... (w.isdigit() and 1950<=int(w)<=2025) or w.isalpha()) ?

Aran-Fey Over a year ago

@JonClements isalpha() isn't the same thing as not containing any digits. Any sort of punctuation or other special character would cause a word to be discarded.

RomanPerekhrest · Accepted Answer · 2017-06-05 15:16:37Z

1

Short solution using re.findall() function:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'
result = ''.join(re.findall(r'\b(19[5-9][0-9]|20[01][0-9]|202[0-5]|[a-z]+|[^0-9a-z]+)\b', s, re.I))

print(result)

The output:

2014           keep this text      2015 2025

answered Jun 5, 2017 at 15:16

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Comments

Fomalhaut · Accepted Answer · 2017-06-05 15:04:08Z

1

I would do it like this because it's readable and easy to fix of to improve:

' '.join(
    filter(
        lambda word: (word.isdigit() and \
                      int(word) >= 1950 and \
                      int(word) <= 2025) or \
                     re.match(r'^[a-zA-Z]+$', word),
        s.split()
    )
)
# '2014 keep this text 2015 2025'

edited Jun 5, 2017 at 15:04

answered Jun 5, 2017 at 15:00

Fomalhaut

9,99111 gold badges59 silver badges111 bronze badges

1 Comment

boot-scootin Over a year ago

Nice, but what about years in the range 1950-1999?

Collectives™ on Stack Overflow

Regex: replace all numbers and "number-like" strings except for years in range

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related