0

I'm facing a very strange error with regex in php. My Pattern is /\[B\]\[SIZE=3\](Trama|Recensione:|Curiosità|Trama:)\[\/SIZE\]\[\/B\](.*?)\[B\]\[SIZE=3\]/is

And it works with "Trama", "Recensione:", and "Trama:", but not with "Curiosità" in my script. The strange thing is that if i type this pattern here, it matches all correctly. What am I doing wrong?

My script:

$query = $db->query("SELECT `t`.`threadid`, `t`.`title`, `t`.`firstpostid`, `t`.`dateline`, `f`.`parentid` FROM {$db->tabelle['topic']} AS t, {$db->tabelle['forum']} AS f WHERE `f`.`forumid` = `t`.`forumid` AND `f`.`parentid` = ". (SEZIONE_RECENSIONI) ." AND `visible` = 1 ORDER BY `dateline` DESC LIMIT 10");
        while($thread = $db->fetch_array($query))
        {
            $post = $db->fetch_array($db->query("SELECT `pagetext`, `userid` FROM {$db->tabelle['post']} WHERE `postid` = {$thread['firstpostid']}"));

            $pattern = "/\[cover\](.*?)\[\/cover\]/is";
            preg_match($pattern, $post['pagetext'], $cover);

            $pattern = '/\[B\]\[SIZE=3\](Trama|Recensione:|Curiosità|Trama:)\[\/SIZE\]\[\/B\](.*?)\[B\]\[SIZE=3\]/isU';
            preg_match($pattern, $post['pagetext'], $trama);
            $content = remove_bbcode($parser->parse(truncate(utf8_encode($trama[2]), 350, '...', false, true)));
            $page .= "<li>
            <div class=\"recensione\" style=\"background: url(".$cover[1].") no-repeat; background-size: cover; background-position: 20% center; \">
                <p class=\"recensione_titolo\"><a href=\"?rec={$thread['threadid']}\">{$thread['title']}</a></p>
                <p class=\"recensione_content\">{$content} <a href=\"?rec={$thread['threadid']}\"><em>Continua a leggere</em></a></p>
            </div>
        </li>";
        }
1
  • Try adding /U flag to make it /isU Commented Sep 6, 2014 at 11:43

1 Answer 1

2

It can be an UTF8 problem, you can try to inform the regex engine that the target string must be read as an utff8 string. To do that you can add (*UTF8) at the begining or you can use the u modifier:

$pattern = '~(*UTF8)\[B]\[SIZE=3](Trama:?|Recensione:|Curiosità)\[/SIZE]\[/B](.*?)\[B]\[SIZE=3]~s';

or

$pattern = '~\[B]\[SIZE=3](Trama:?|Recensione:|Curiosità)\[/SIZE]\[/B](.*?)\[B]\[SIZE=3]~su';

Note: to avoid a lot of backslashes in your expression, to make it more readable:

  • you can change the pattern delimiter, (no need escape slashes)
  • the literal closing bracket doesn't need to be escaped.
  • you can use \Q and \E to quote literal substring
  • you can use the freespacing mode x

example:

$pattern = '~
    \Q[B][SIZE=3]\E
    (Trama:?|Recensione:|Curiosità)
    \Q[/SIZE][/B]\E   (.*?)  \Q[/SIZE][/B]\E ~xus';
Sign up to request clarification or add additional context in comments.

3 Comments

Ok, it's a UTF8 problem, but if I use the u modifier it completely stops working.
@DavideR: try to determine what is the encoding of the original text, and convert it to utf8. (in particular, take a look at the default encoding in your code editor)
Ok, I thing I'm going to replace à, è, ì and any other special character with a, e, i and so on. Thank you for your suggestions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.