1

I have to process text that comes from student essays (texts can be VERY large).

I need in PHP a preg_match for dates inside that strings which may come in this way:

...blah blah blah (1994) blah blah blah ... 
...blah blah blah (nov-1994) blah blah blah ... 
...blah blah blah (november-1994) blah blah blah ...
...blah blah blah (1994-nov) blah blah blah ...
...blah blah blah (1994-november) blah blah blah ...

The dates in the strings may come with '( )' or with '[ ]'

I have done it this way:

if (preg_match('/\w{0,8}-?(19|20)\d{2}-?\w{0,8}/', $string, $s)) {
 # code
}

which is right and do its job but its capturing some unrelated strings like

... blah blah blah (SKU_1956) blah blah blah ...
... blah blah blah [INFERNO2000] blah blah blah ...
... blah blah blah [like-2000-me] blah blah blah ...

I dont seem to be able to do it, so I need help to fine-tuning this regexp to only capture if

  • start with either ( [
  • may be a single word and if it exists, MUST end in -
  • MUST BE a year in the lap 19xx-20xx
  • may be a single word and if it exists, MUST start with -
  • end with either ) ]

The word is limited to 8 chars because of the longest month (like december)

There is a huge amount of non-related strings captured, thats why I want to fine-tuning it.

2
  • 1
    Try something like [([]((\w{1,8})-((?:20|19)\d{2})|(?3)-(?2)|(?3))[])] Commented Mar 30, 2018 at 10:10
  • You will need to specify the months to realize this. Because \w or [a-z] does not make a difference between a word "like" and "nov". You also need to escape the "-"-character. Commented Mar 30, 2018 at 10:12

2 Answers 2

1

You can use the RegEx [(\[](([a-zA-Z]{1,8}-)?(19|20)\d{2}|(19|20)\d{2}-[a-zA-Z]{1,8})[)\]]

  • [(\[] ... [)\]] matches anything inside () or []

  • ([a-zA-Z]{1,8}-)?(19|20)\d{2} matches month-YEAR with the month being optional

    • ([a-zA-Z]{1,8}-)? matches an alphabetical char between 1 and 8 times, and a -

    • (19|20)\d{2} matches 19.. or 20..

  • (19|20)\d{2}-[a-zA-Z]{1,8}) matches YEAR-month

Demo.

Sign up to request clarification or add additional context in comments.

2 Comments

what about "like-1994"?
yes, (like-1994) is also captured... BTW there were a small typo error: it should capture the 4 year digits, so I think Zenoo answer should be [([](([a-zA-Z]{1,8}-)?(19\d{2}|20\d{2})|(19\d{2}|20\d{2})\-[a-zA-Z]{1,8})[)]] regex101.com/r/rGPPzh/1 THUMBS UP FOR (regex101.com) Didn't knew that site !!! Think that close the question, I will be dealing with the text (if it is a month) in PHP.
0

You could list all the valid date formats in an array:

$formats = ["M-Y", "Y", "F-Y", "Y-F", "Y-M"];

and then loop them to test if you can create a valid DateTime:

As the regex pattern you could capture what is between the parenthesis in group 1:

/\(([^)]+)\)/

$strings = [
    "...blah blah blah (1994) blah blah blah ... ",
    "...blah blah blah (nov-1994) blah blah blah ... ",
    "...blah blah blah (november-1994) blah blah blah ...",
    "...blah blah blah (1994-nov) blah blah blah ...",
    "...blah blah blah (1994-november), (1994), (nov-1994) blah blah blah ...",
    "...blah blah blah (1994-november) blah blah blah ..."
];
$formats = ["M-Y", "Y", "F-Y", "Y-F", "Y-M"];
$pattern = '/\(([^)]+)\)/';
foreach ($strings as $string) {
    preg_match_all($pattern, $string, $matches);
    foreach ($matches[1] as $match) {
        foreach ($formats as $format) {
            if (DateTime::createFromFormat($format, $match) !== false) {
                echo "$string contains valid date: <b>$match</b>" . PHP_EOL;
                break;
            }
        }

    }
}

Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.