0

I have two strings. One of them contains <em> tag, is completely lowercase and doesn't contain delimiters or common words like 'the', 'in', etc. while the other isn't. An example:

$str1 = 'world <em>round</em>';
$str2 = 'World - is Round';

I want to make the $str2 as 'World - is <em>Round</em>', by comparing which lowercase word in the $str1 contains the <em> tag. So far, I've done the following, but is fails if number of words aren't equal in both strings.

public static function applyHighlighingOnDisplayName($str1, $str2) {
    $str1_w = explode(' ', $str1);
    $str2_w = explode(' ', $str2);
    for ($i=0; $i<count($str1_w); $i++) {
       if (strpos($str1_w[$i], '<em>') !== false) {
            $str2_w[$i] = '<em>' . $str2_w[$i] . '</em>';
       }
    }
    return implode(' ', $str2_w);
}

$str1 = '<em>cup</em> <em>cakes</em>' & $str2 = 'Cup Cakes':

applyHighlighingOnDisplayName($str1, $str2) : '<em>Cup</em> <em>Cakes</em>': Correct

$str1 = 'cup <em>cakes</em>' & $str2 = 'The Cup Cakes':

applyHighlighingOnDisplayName($str1, $str2) : 'The <em>Cup</em> Cakes: Incorrect

How should I change my approach?

3
  • user regular expression, and preg_replace; Commented Oct 29, 2014 at 14:24
  • Can you fix the formatting on your question so it's clearer what is code and what isn't? Also, will both words ALWAYS be in $str2 -- i.e. do you need to check that the non-<em> word is present? Commented Oct 29, 2014 at 14:25
  • You could try using a regular expression to find the word that is wrapped by <em></em>. Commented Oct 29, 2014 at 14:26

3 Answers 3

1

Like others said, regex is the solution. Here is a working example with detailed comments:

$string1 = 'world <em>round</em>';
$string2 = 'World is - Round';

// extract what's in between <em> and </em> - it will be stored in $matches[1]
preg_match('/<em>(.+)<\/em>/i', $string1, $matches);

if (!$matches) {
    echo 'The first string does not contain <em>';
    exit();
}

// replace what we found in the previous operation
$newString = preg_replace('/\b' . preg_quote($matches[1], '\b/') . '/i', '<em>$0</em>', $string2);
echo $newString;

Details at:

Later edit - cover multiple cases:

$string1 = 'world <em>round</em> not <em>flat</em>';
$string2 = 'World is - Round not Flat! Round, ok?';

// extract what's in between <em> and </em> - it will be stored in $matches[1]
preg_match_all('/<em>(.+?)<\/em>/i', $string1, $matches);

if (!$matches) {
    echo 'The first string does not contain <em>';
    exit();
}

foreach ($matches[1] as $match) {
    // replace what we found in the previous operation
    $string2 = preg_replace('/\b' . preg_quote($match) . '\b/i', '<em>$0</em>', $string2);
}

echo $string2;
Sign up to request clarification or add additional context in comments.

3 Comments

@motnelu: What if the string contains multiple instances of <em> tags? For eg.: $str1 = 'The Cup Cakes' $str2 = '<em>cup</em> <em>cakes</em>'?
It will work, since the replacement is done in a case insensitive manner (see requirement in the question).
@motanelu: Can I also apply <em> to cases where there are special characters like '? For eg.: <em>users</em> and User's?
1

Your current method is dependent on the number of words in the strings; a better solution would be to use regular expressions to do the matching for you. The following version will work safely even if you have emphasized words that are substrings of other emphasized words (e.g. "cat" and "cat's cradle" or "cat-litter").

function applyHighlighingOnDisplayName($str1, $str2) {

    # if we have strings surrounded by <em> tags...
    if (preg_match_all("#<em>(.+?)</em>#", $str1, $match)) {

        ## sort the match strings by length, descending
        usort($match[1], function($a,$b){ return strlen($b) - strlen($a); } );

        # all the match words are in $match[1]
        foreach ($match[1] as $m) {
            # replace every match with a string that is very unlikely to occur
            # this prevents \b matching the start or end of <em> and </em>
            $str2 = preg_replace("#\b($m)\b#i",
                "ZZZZ$1ZZZZ",
                $str2);
        }
        # replace ZZZZ with the <em> tags
        return preg_replace("#ZZZZ(.*?)ZZZZ#", "<em>$1</em>", $str2);
    }
    return $str2;
}

$str1 = 'cup <em>cakes</em>';
$str2 = 'Cup Cakes';

print applyHighlighingOnDisplayName($str1, $str2) . PHP_EOL;

Output:

Cup <em>Cakes</em>
The Cup <em>Cakes</em>

Two strings with no <em>'d words:

$str1 = 'cup cakes';
$str2 = 'Cup Cakes';

print applyHighlighingOnDisplayName($str1, $str2) . PHP_EOL;

Output:

Cup Cakes

Now somethings rather trickier: lots of short words where one word is a substring of all the other words:

$str1 = '<em>i</em> <em>if</em> <em>in</em> <em>i\'ve</em> <em>is</em> <em>it</em>';

$str2 = 'I want to make the str2 as "World - is Round", by comparing which lowercase word in the str1 contains the em tag. So far, I\'ve done the following, but it fails if number of words aren\'t equal in both strings.';

Output:

<em>I</em> want to make the str2 as "World - <em>is</em> Round", by comparing which lowercase word <em>in</em> the str1 contains the em tag. So far, <em>I've</em> done the following, but <em>it</em> fails <em>if</em> number of words aren't equal <em>in</em> both strings.

3 Comments

What if the string contains multiple instances of <em> tags? For eg.: $str1 = 'The Cup Cakes' $str2 = '<em>cup</em> <em>cakes</em>'?
@i alarmed alien: Can I also apply <em> to cases where there are special characters like '? For eg.: <em>users</em> and User's?
@user188995 As you can see in the example I posted, you can run it on strings with ' in them, but it assumes that you're not removing characters like ' that are grammatically important. users and user's have very different meanings--I'm not sure it's wise to strip these words of their semantics.
0

It's because your highlighting code is expecting a 1:1 correspondence between word positions in the two strings:

cup <em>cakes</em>
 1        2
Cup     Cakes

but on your incorrect sample:

cup <em>cakes</em>
 1        2            3
The      Cup         Cakes

e.g. you find <em> at word #2, so you highlight word #2 in the other string - but in that string, word #2 is Cup.

A better algorithm would be to strip the html from your original string, so you end up with just cup cakes. Then you look for cup cakes in the other string, and highlight the second word of that location. That'll compensate for any "motion" within the string caused by extra (or fewer) words.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.