2

I have the following string:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I want to replace with '' every part of this string which contains a number, except for those parts of the string that are in the year range 1950 to 2025. The resultant string would look like this (don't worry about the extraneous whitespace):

'2014          keep this text      2015 2025 '

So, effectively I want the brute-force removal of anything and everything remotely "numerical," except for something standalone (i.e. not part of another string, and of length 4 excluding whitespace) that resembles a year.

I know I can use this to remove everything containing digits:

re.sub('\w*[0-9]\w*', '', s)

But that doesn't return what I want:

'           keep this text        '

Here's my attempt at replacing anything that doesn't match the patterns listed below:

re.sub(r'^([A-Za-z]+|19[5-9]\d|20[0-1]\d|202[0-5])', '*', s)

Which returns:

'* 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'

I've been here and here, but wasn't able to find what I was looking for.

2
  • Do you need to keep all these whitespaces in the result? Commented Jun 5, 2017 at 14:56
  • No, I'll eventually just strip those out. But that's an easy task. I'm more concerned about the number-like removal excluding years. Commented Jun 5, 2017 at 14:58

3 Answers 3

2

Regex isn't good at working with numbers. I would ditch regex and use a generator expression:

predicate= lambda w: (w.isdigit() and 1950<=int(w)<=2025) or not any(char.isdigit() for char in w)
print(' '.join(w for w in s.split() if predicate(w)))
Sign up to request clarification or add additional context in comments.

2 Comments

Maybe... (w.isdigit() and 1950<=int(w)<=2025) or w.isalpha()) ?
@JonClements isalpha() isn't the same thing as not containing any digits. Any sort of punctuation or other special character would cause a word to be discarded.
1

Short solution using re.findall() function:

s = '2014 2026 202 20 1000 1949 194 195092 20111a a2011a a2011 keep this text n0t th1s th0ugh 1 0 2015 2025 2026'
result = ''.join(re.findall(r'\b(19[5-9][0-9]|20[01][0-9]|202[0-5]|[a-z]+|[^0-9a-z]+)\b', s, re.I))

print(result)

The output:

2014           keep this text      2015 2025 

Comments

1

I would do it like this because it's readable and easy to fix of to improve:

' '.join(
    filter(
        lambda word: (word.isdigit() and \
                      int(word) >= 1950 and \
                      int(word) <= 2025) or \
                     re.match(r'^[a-zA-Z]+$', word),
        s.split()
    )
)
# '2014 keep this text 2015 2025'

1 Comment

Nice, but what about years in the range 1950-1999?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.