0

i have a really strange problem where i spent many hours and without any success... . I have a contenteditable area on my website where users can select emoticons which one they can see instantly in their written text (in case of the contenteditable area). So for messages from user to user i do not care about the length of the text but for writing comments i do! I need to count all characters of the string.

Now i have the problem that emoticons are transmitted like that:

<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon emoticon-class-name-for-example-happy">

Okay for sure i want to count only 1 character for each emoticon so i wrote a regex and tried to replace all emoticons with a '1'. Afterwards i thought it is pretty easy with just strlen i get the number of used characters. But this works only in theory, but damn why... .

So my regex is:

<img[ ]src=["'].+?["'][ ]class=["']emoticon[ ].+?["'][>]

the next point was that i started to test my regex with the help of phpliveregex.com . The result you can see here. Just click on the preg_replace tab.

Now i was pretty sure that this has to work for me and i tried it. I wrote a function in PHP:

private function countCharactersOfSpecialUserInput($userInput) {
    $wholeCharacters = 0;
    $input_lines = 'This is a test
                    for<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">my
                    <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">regex 
                    which<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">should
                    be alright <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Not-Talking">and<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Not-Talking">
                    match all this emoticons except things like <img dsopjfdojp
                    <img oew> because this ones are not real emoticons! The following is a real one: <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">
                    ';      
    return preg_replace("/<img[ ]src=[\"'].+?[\"'][ ]class=[\"']emoticon[ ].+?[\"'][>]/", "1", $input_lines);
}

In my function i does not count the characters right now because there is a bug, which i do not understand. It will sound impossible but it is real :-(.

If i use the string which is safed in the variable $input_lines it works well. But if i use the text which a user can transmit it does not work!

I used var_dump as well as print_r to get the transmitted data from the user. Afterwards i used exactly this string and saved it in the input_lines variable. And the unbelievable fact is by using the input_lines variable it works again... . Doesn't matter what i do my code does not replace a single emoticon while the text was transmitted dynamically by the user... .

Is there anything where you could imagine what could case this problem? I am clueless and i can not believe that this is real. It has to work i tried so many other things about that but nothing worked for me... .

4
  • 1
    Aren't you better of strlening the original source data (containing the emoticon's code), in stead of rendered data (containing the img elements)? Commented May 30, 2015 at 12:42
  • i do not know whether you understand my problem... if i just use strlen than i get for only one emoticon about 80 or 90 characters but the user used only 1 emoticon which should be count as 1 used character! Commented May 30, 2015 at 12:44
  • @hek2mgl if noone can help me i will have to rethink and than i will have a look at the DOM feature of php but i really prefer to solve this with just a regex... it has to work but it doesn't - for any advices i would be really grateful. Maybe you describe a solution with using DOM @hek2mgl? Commented May 30, 2015 at 12:51
  • @user3714751 I have added an answer using DOMDocument. Commented May 30, 2015 at 14:43

3 Answers 3

1

The text with the images is actually a HTML snippet, therefore I would use DOM to parse it:

$input_lines = 'This is a test for<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">my <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">regex which<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">should be alright <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Not-Talking">and<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Not-Talking"> match all this emoticons except things like <img dsopjfdojp <img oew> because this ones are not real emoticons! The following is a real one: <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">';

$doc = new DOMDocument();

// Suppress warnings
@$doc->loadHTML($input_lines);

$imgs = $doc->getElementsByTagName("img");
$number_of_imgs = $imgs->length;
echo "Found $number_of_imgs images" . PHP_EOL;

// The plain text is actually the nodeValue of
// the whole snippet.
$text = $imgs->item(0)->parentNode->nodeValue;
$len = mb_strlen($text);

echo "Text length: $len + $number_of_imgs(images)" . PHP_EOL;

See it working: http://3v4l.org/MH5T6

Sign up to request clarification or add additional context in comments.

9 Comments

this won't work perfect because for example <img oew> is recognized as a image but it isn't a real emoticon... . I see not a solution with domDocument which works as well as a good regex. Or i am wrong?
What means <img dsopjfdojp <img oew> ? I'm confused. Basically I would suggest to send the plain text and replace emoticons on client site (or right before you output the content with PHP). The database should not contain the image tags
okay i think you mean the point with the database in case of sql injection, but for that you should know the user_input runs trough a special function which will find any emoticons with unique phrases which will be not harmed while secure the message against sql injection and after all is secure the unique phrases will be parsed back to the image tags. I mentioned the <img dsopjfdojp <img oew> because a user can use this to write more text as allowed it is only text and will not be parsed as emoticon but in your way it would be count as one character you understand my problem?
Yes, got that. Note that I didn't talked about SQL injections, I talk about functionality. The replacement of emoticons should happen on view level not in the database. Do it like this and all your problems are gone. Is this advantage enough? Take stackoverflow for example, they save the markup in the database, not the rendered html.
i thought about that as i did it for all areas on my site but my solution has the advantage that i do not need keywords like 8) or :-) for each smiley. The next point is with contenteditable the user has already the smiley visual 1:1 as it looks like if he post his text. The point is i did a lot of stuff too secure this method and do this. If i would change this know all time for that was wasted. The funny part is the point that the only problem which i encountered is the fact that i am not able to count the characters right now... . But thank you for your opinion i will think about it.
|
0

It would be prudent for you to store emoticons in the database as text. For example a happy face can be stored as :) or =) and only use up 2 characters in your database.

Then on output do the OPPOSITE of what you are doing here and use preg_replace to replace all instances of :) or =) etc.. with the relevant <img src=...

This is almost the standard used in all web applications. It will allow you to dynamically change what emoticons you are using later, for example if you change your template and want the emoticons to change as well, you change your emoticon function and all previous occurances in the database will also change.

This would not only assist you with the counting of characters but future management and cleanliness of your database.

<?php
    $input = 'Hello There! :) How are you today?';
    $happy = '<img src="img/smile.gif" border="0" />';

    $output = preg_replace("(\:\))", $happy, $input);

    echo $output;
?>

View In Action

Obviously you could go so far as to adapt this into using a database to manage your smilies and using an array to run pregreplace. The sky becomes the limit.

4 Comments

You know that ':)' isn't valid regex, right? You don't even have delimiters
Not even sure why I posted it with that, forgot to escape it correctly. Appologies and thank you for pointing it out. Big bad oops.
whether i replace :-) or a image tag which is also only text which i can match with a regex does not matter in my opinion. So this does not help me out of my problem.
One way works and is simple and scalable, the other is currently not working and is complex? You can also -1 from the length for each occurance if you use a while or foreach loop to make the match. Simply take the length at the start, run the match algorithm and for each occurance of :) -1 from the length. Your data becomes a lot more managable and this should be easy for you to implement.
0

Why are you using var_dump and print_r to get data from the user? Those functions echo inputs to standard out, they don't actually return strings. Take a look:

php > $num_finds = preg_replace("/<img[ ]src=[\"'].+?[\"'][ ]class=[\"']emoticon[ ].+?[\"'][>]/", "1", $lines);
php > echo($num_finds);
1my1regex which1should be alright 1and1 match all this emoticons except things like <img dsopjfdojp <img oew> because this ones are not real emoticons! The following is a real one: 1

works fine. If, however, you try to use var_dump, you get this:

php > $dump_num_finds = preg_replace("/<img[ ]src=[\"'].+?[\"'][ ]class=[\"']emoticon[ ].+?[\"'][>]/", "1", var_dump($lines));
string(718) "<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">my<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">regex which<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">should be alright <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Not-Talking">and<img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Not-Talking"> match all this emoticons except things like <img dsopjfdojp <img oew> because this ones are not real emoticons! The following is a real one: <img src="data:image/gif;base64,R0lGODlhAQABAAAAACwAAAAAAQABAAA=" class="emoticon Girl">"
php > echo $dump_num_finds;

Again, the reason is that var_dump doesn't return anything. Unless you're using something like ob_start() with ob_get_clean() to get the string echo'd to standard out (which imo is a poor solution and won't work), your approach will not work. You can also pass true as the second parameter to print_r for it to return output, but I'm having trouble seeing why you'd be using either of these functions in the first place.

P.S. As a side note, in my opinion, your regex is a bit sloppy. You should use \s to signify a whitespace character instead of [ ]. You could also just use without the brackets and it would do the same thing. Also, you don't need the brackets around the last >:

<img\ssrc=["'].+?["']\sclass=["']emoticon\s.+?["']>

1 Comment

sorry but you understood something wrong. I do not use var_dump or print_r to get data from the user i use this too functions to debug/test my function return value just for developing reasons... echo gives me not the data type and so on. Your hint for improving my regex is good thx for that, but is there really a recognizable difference (performance or something like that) between \s or [ ] for white spaces ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.