0

I am working on a large batch of text strings, trying to match date times and convert them to MM-DD-YYYY format using strptime() function.

However, there are some 5-digit serial number appeared in the texts (e.g., 90481) that have mislead my .findall() function to treat them as date times. How can I avoid them by including a ^() type of condition to exclude them?

What them have in common is that they are all 5-digit, so I have tried ^(?!\d{5}) but it didn't turn out well. What's the best way to tackle this set of number?

Thank you.

Note1: I have read this post, but can't seem to get it.

Note2: about date format someone have asked in the comment section

There are many date formats in the data frame I am working on, for example:

05/10/2001; 05/10/01; 5/10/09; 6/2/01
May-10-2001; May 10, 2010; March 25, 2001; Mar. 25, 2001; Mar 25 2001;
25 Mar 2001; 25 March 2001; 25 Mar. 2001; 25 March, 2001
Mar 25th, 2001; Mar 25th, 2001; Mar 12nd, 2001
Feb 2001; Sep 2001; Oct 2001
5/2001; 11/2001
2001; 2015

So I have a rather long .findall(r' ') function, but the main point is to avoid those 5-digit serial number from be selected.

Sincerely,

5
  • 1
    How does your findall works in the first place? Please post the full regex. Commented Aug 5, 2017 at 16:10
  • I'll add that to the original question thread. Commented Aug 5, 2017 at 16:11
  • If you could explain in plain English what exactly you need to match I would be more ale to help you. Date times can be written with many formats so not knowing what exactly you are working with makes it hard. Commented Aug 5, 2017 at 16:12
  • The regex doesn't seem to match 90481 Commented Aug 5, 2017 at 16:15
  • I have added my (rather simple) code into the original thread, just trying to avoid those 5-digit serial number, so that Python won't treat them as date times. Commented Aug 5, 2017 at 16:15

1 Answer 1

1

You could use \b in your regex, to avoid that a match is found halfway a number with more digits. Place one at the start and one at the end, and make sure they are not included in the scope of the | (OR) operation by wrapping the rest in a non-capture group.

I removed some months to keep it short:

\b(?:\d{1,2}\/\d{1,2}\/\d{2,4}|(?:Jan|Feb|Mar|Apr|   |Nov|Dec)[a-z]*-\d{2}-\d{2,4})\b
Sign up to request clarification or add additional context in comments.

2 Comments

This works perfectly! Thank you so much. Do you mind me asking how (and why) does \b( )\b work in this context?
\b matches with a break between a sequence of alphanumerical characters and non-alphanumerical characters (it does not match a character, just the fact there is a break in the sequence). So when the first character of your match is supposed to be a digit, the first \b requires that there is no digit (or letter or underscore) preceding that matched character. Similar thing happens at the end.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.