3

Take the following PHP regular expression:

/^(what is|tell me) your name$/

I want to determine the total number of available words within the pattern. The correct answer would be 4 seeing as the following combinations are compatible:

what is your name => 4 words
tell me your name => 4 words

A simple count(explode(' ', '/^(what is|tell me) your name$/')) is not going to cut it, seeing as the explode function would return the following:

['/^(what', 'is|tell', 'me)', 'your', 'name$/']

...which defines 5 "words", when really, only 4 are available within the pattern.

Here's another example:

/^(my|the) name is (\w+)$/ => 4 words

Is there a function already available that I can utilise, or would I have to create a fairly tech one from scratch?

Kudos if anyone's willing to give it a shot.

9
  • 3
    Use sizeof(explode(" ", $str)) Commented Nov 8, 2016 at 13:05
  • 2
    Incorrect. That would split the first pattern into: ['/^(what', 'is|tell', 'me)', 'your', 'name$/'] - which is 5 "words". Commented Nov 8, 2016 at 13:11
  • I think there's no magic function that tells you how many words your written pattern could match. Probably you'll need to write your own. What about nested parenthesis? Commented Nov 8, 2016 at 13:20
  • What should return ^.+$? Commented Nov 8, 2016 at 13:21
  • 1
    Finally a question I haven't seen on SO yet. Wonderful. I think I'm going to give some of my points as a bounty for an answer. Commented Nov 8, 2016 at 14:14

2 Answers 2

1

This is extremely ugly, but maybe you can use some of the logic? It seams to work.

I basicly split the string into 2 different strings. $first_string is the part between the parentheses (). I explode this string on | and count the whitespaces in the new string +1.

The second part of the string $second_string I simply strip out all non alphabetic chars and double whitespaces and count the words.

Finaly I add $first_string + $second_string to get the final result.

One weakness to this is if you have a string with (something | something else), I don't think my method of counting whitespaces can handle different amounts of words on each site of the |.

<?php

    $string='/^(my|the) name is (\w+)$/';
    $pattern='/\(([^\)]+)\)/'; // Get text between ()
    $pattern2 = '([^a-zA-Z0-9 $])'; // all non alphabetic chars except $

    preg_match($pattern,$string, $first_string); // get text
    $first_string=explode('|', $first_string[0]); 

    $new_string = preg_replace($pattern, '', $string);
    $new_string2 = preg_replace($pattern2, '', $new_string);
    $new_string2 = removeWhiteSpace($new_string2);

    // count words
    $first_string=substr_count($first_string[0]," ")+1;
    $second_string = sizeof(explode(" ", $new_string2)); // count words

    // removes double white space
    function removeWhiteSpace($text)
    {
        $text = preg_replace('/[\t\n\r\0\x0B]/', '', $text);
        $text = preg_replace('/([\s])\1+/', ' ', $text);
        $text = trim($text);
        return $text;
    }

    echo $first_string+$second_string; // final result


?>
Sign up to request clarification or add additional context in comments.

Comments

1

Decided to give it a go myself and there are a ton of problems with this concept. Here's a couple:

/^(tell me|hey what is) your name$/

A correct answer would be both 4 and 5 words - presenting inconsistency.

/^hey what (.+) up to$/

What happens in this instance? The parenthesis could contain any number of potential words.

So, all in all, the idea of a function to detect a definitive answer was, perhaps, pretty silly ^o^

Nevertheless, I gave it a shot and here's what I came up with, incompatible with (.+) and fairly untested, unleash the horror...

/**
 * Try to detect min/max amount of words in the given pattern.
 *
 * @param string $pattern
 * @param string $or_words_pattern
 * @param string $unwanted_pattern
 * @return array
 */
function regex_word_count(
    $pattern, 
    $or_words_pattern = '/\((\w|\s|\|)+\)/',
    $unwanted_pattern = '/[^a-zA-Z0-9\|\(\)\s]/')
{
    $result = ['min' => 0, 'max' => 0];
    $pattern = str_replace('\s', ' ', $pattern);
    $pattern = preg_replace($unwanted_pattern, null, $pattern);

    if (preg_match_all($or_words_pattern, $pattern, $ors)) {
        $matches = current($ors);

        foreach ($matches as $match) {
            $strings = explode('|', $match);

            foreach ($strings as $string) {
                $counts[$match][] = count(explode(' ', $string));
            }
        }

        foreach ($counts as $count) {
            $result['min'] += min($count);
            $result['max'] += max($count);
        }

        $pattern = trim(preg_replace($or_words_pattern, null, $pattern));
        $pattern = preg_replace('/\s+/', ' ', $pattern);
    }

    if (!empty($pattern)) {
        $count = count(explode(' ', $pattern));
        $result['min'] += $count;
        $result['max'] += $count;
    }

    return $result;
}

Example:

$x = regex_word_count('/^(a{3}) ([abc]) (what is the|tell me) your (name|alias dude)$/');

die(var_dump($x));

// array(2) {
//   'min' =>
//   int(6)
//   'max' =>
//   int(8)
// }

It was a fun exercise of trying to do something, well, impossible.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.