2

I'm new in PHP

I have an array like this

$suspiciousList = array(
array ("word" => "badword1", "score" => 400, "type" => 1), 
array ("word" => "badword2", "score" => 250, "type" => 1),
array ("word" => "badword3", "score" => 400, "type" => 1), 
array ("word" => "badword4", "score" => 400, "type" => 1));

I have problems when users input words with spaces like (badw ord1, b adword2, etc.), or a user may input like (b a d w o r d 1)

How can I detect or search for combinations from the array (dictionary)?

My idea is to make every word become an array split by spaces.

$this->suspiciousPart[] = $word;

I'm write following function

public function deepDetect2() {
    for($i=0;$i<sizeof($this->suspiciousPart);$i++) {
        $word = "";
        for($j=$i;$j<sizeof($this->suspiciousPart);$j++) {
            $word .= $this->suspiciousPart[$j];
            //var_dump($word);
            if(strpos(in_array($word, $this->suspiciousList), $word) !== false) {
                if($this->detect($word) == true) {
                    $i++;
                } else {
                    $j++;   
                }
            } else {
                $i++;
            }
        }
    }
}

Anybody have other ideas how to do this?

Thanks

6
  • 2
    just an idea - change your keywords! this is also a kid site (If they are coders), you can't use this words. Commented Jun 19, 2011 at 9:16
  • gotta love the array values :D Commented Jun 19, 2011 at 9:19
  • agreed (to the first poster), are you 12 or something and find these words cool? Change them. Commented Jun 19, 2011 at 9:20
  • Hi, thanks for idea, actually i have more profanity keyword. Commented Jun 19, 2011 at 9:24
  • A clbuttic mistake in the making ... Commented Jun 19, 2011 at 9:24

5 Answers 5

2
  1. Strip spaces
  2. Search with ONE regular expression containing all your keywords, like this: (word1|word2|word3)
Sign up to request clarification or add additional context in comments.

2 Comments

ad 2: It is certainly a way how to do that but Aho-Corasick algorithm is better for the task if the number of forbidden words is high.
ad Aho-Corasick: you are right, this is the best algorithm for that. But 1. default regular matching should be just fine for most cases, 2. in theory regular expression matcher can use Aho-Corasick inside (but the default one doesn't, as far as I know; but for example "fgrep" uses Aho-Corasick)
2

This question is a good start: How do you implement a good profanity filter? - and I agree with the conclusion, i.e. the detection will have always poor results.

I would try these approaches:

1) Simply detect words that are vulgar according to your dictionary.

2) Come up with a few heuristics like "continuous sequence of 'words' composed of one letter" (b a d w o r d) and use them to evaluate users' posts. Then you can compute expected number of vulgar words: \sum_i^{number of your heuristics} P_i * N_i, where P_i is the probability that word found with heuristic i is really a vulgar one and N_i is a number of words found by heuristics i. I think the probabilistic approach is better than simply stating "this post does (not) contain a vulgar word".

3) Let a moderator decide if a post is really vulgar or not. Otherwise imperfection of your automatic replacing method will most probably get your users mad.

4) I think it's useless to look up words in an English (or Turkish?) dictionary in order to find words that are not really English words because people misspell words too much these days.

Comments

2

Anyway, you can strip whitespace characters and use (mb_)substr_count() but it leads to getting false positives.

Comments

2

As Jirka Helmich suggested you could remove whitespaces (and maybe other special chars) and then search the string to find words from your array.

public function searchForBadWords($strippedText) {
     foreach($suspiciousList as $suspiciousPart) {
          $count = substr_count($strippedText, $suspiciousPart['word']);
          //you can use str_replace here or something, it depends what you want to achive
     }
}

Problem is if you have words like blablabad wordblabla and you remove spaces to normal words could become bad words blablabadwordblabla (know what I mean?) :D

Cheers

Edit: So Ahmad I see you just get words recognizing them by " " on the beginning/end(in shortcut). Maybe you should try to implement both methods, yours with single words and this above with substring searching. It depends also how much you care about performance. Maybe you should try do some reserches or sth to see how effective it is?:D

1 Comment

I'm using these following code to make it array. $words = mb_strtolower($words, 'UTF-8'); $words = $this->removeUniCharCategories($words); $words = explode(" ",$words); //Remove empty Array ! $words = array_filter($words); foreach ($words as &$value) { $newWords[] = $value; } $words = $newWords; But i'm still find the best sollution
1

@f1ames : I'm using these following code to make it array.

    $words = mb_strtolower($words, 'UTF-8');
    $words = $this->removeUniCharCategories($words);
    $words = explode(" ",$words);
    //Remove empty Array !
    $words = array_filter($words);
    foreach ($words as &$value) {
        $newWords[] = $value;
    }
    $words = $newWords;

But i'm still find the best sollution

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.