76

I'm getting strange characters when pulling data from a website:

Â

How can I remove anything that isn't a non-extended ASCII character?


A more appropriate question can be found here: PHP - replace all non-alphanumeric chars for all languages supported

9
  • 1
    What do you mean when you say non-ascii, Â is an ascii character (#194) Commented Jan 8, 2012 at 22:30
  • 1
    oh. well, I mean things like letters and characters such as $(#*@. I don't know how to explain it other than I only want characters you'd be able to type on your keyboard. Commented Jan 8, 2012 at 22:32
  • 2
    Could you define what are normal characters? Commented Jan 8, 2012 at 22:34
  • 9
    I can type "あいうえお" on my keyboard... Maybe you just have an encoding problem and should interpret the text in the right encoding instead of removing things? Commented Jan 8, 2012 at 22:52
  • 2
    @DrewGalbraith #194 is not ASCII, ASCII only goes to #127 Commented Sep 8, 2019 at 22:34

9 Answers 9

128

A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:

$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.

Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

Sign up to request clarification or add additional context in comments.

14 Comments

This solution is not working for me. :( I am getting aAÂ. php 5.3.0. (windows)
How do you make ASCII the selected character set via code?
yes, this answer only works on misconfigured systems 'Â' is clearly a printing character:(it is both inked, and consumes space) use '/[[:^ascii:]]/'' instead of '/[[:^print:]]/' to strip non-ascii.
Jasen, your correction was the right solution for me at least.
@KasparL.Palgi that is exactly what the original question asked to accomplish: remove the characters completely. To replace with an non-accented character, you would need to create a custom mapping of the characters you'd like to replace first.
|
54

Do you want only ASCII printable characters?

Use this:

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/', '', $str);
echo "($str)($res)";

Or even better, convert your input to UTF-8 and use phputf8 lib to translate 'not normal' characters into their ASCII representation:

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str = utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '');

2 Comments

I also wanted to keep the tab character, so I used this regular expression: [^\x00-\x7E]
Thank you! So much better than the accepted answer! over 10 years later, this saved me a lot of grief!
38

Use:

$clearstring = filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

Note that FILTER_SANITIZE_STRING is deprecated since PHP 8.1.

5 Comments

Seems perfect for PHP >= 5.2
This seems to also strip tags. For me it was removing <%AnyTextHere%> See PHP Sanitize filters
Heads up: if you go to functions-online.com to test this, it will put single quotes around FILTER_FLAG_STRIP_HIGH which stops it from working
This was helpful. Though I used FILTER_FLAG_ENCODE_HIGH instead of FILTER_FLAG_STRIP_HIGH
FILTER_SANITIZE_STRING is deprecated since PHP 8.1
26

Kind of related: We had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

The solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

Normally I would do something like this:

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

... but that replaces everything that can't be translated into a question mark (?).

So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
    $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
    $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[çς©с]/u",            "c", $text);
    $text = preg_replace("/[ÇС]/u",              "C", $text);
    $text = preg_replace("/[δ]/u",             "d", $text);
    $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
    $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
    $text = preg_replace("/[₣]/u",               "F", $text);
    $text = preg_replace("/[НнЊњ]/u",           "H", $text);
    $text = preg_replace("/[ђћЋ]/u",            "h", $text);
    $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
    $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
    $text = preg_replace("/[Јј]/u",             "j", $text);
    $text = preg_replace("/[ΚЌК]/u",            'K', $text);
    $text = preg_replace("/[ќк]/u",             'k', $text);
    $text = preg_replace("/[ℓ∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
    $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
    $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[®яЯ]/u",              "R", $text);
    $text = preg_replace("/[ГЃгѓ]/u",              "r", $text);
    $text = preg_replace("/[Ѕ]/u",              "S", $text);
    $text = preg_replace("/[ѕ]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ†‡]/u",              "t", $text);
    $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
    $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
    $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
    $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[‚‚]/u", ",", $text);
    $text = preg_replace("/[`‛′’‘]/u", "'", $text);
    $text = preg_replace("/[″“”«»„]/u", '"', $text);
    $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[‗≈≡]/u", "=", $text);


    // Exciting combinations
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("₧", "Pts", $text);
    $text = str_replace("™", "tm", $text);
    $text = str_replace("№", "No", $text);
    $text = str_replace("Ч", "4", $text);
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[∙•]/u", "*", $text);
    $text = str_replace("‹", "<", $text);
    $text = str_replace("›", ">", $text);
    $text = str_replace("‼", "!!", $text);
    $text = str_replace("⁄", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("⅞", "7/8", $text);
    $text = str_replace("⅝", "5/8", $text);
    $text = str_replace("⅜", "3/8", $text);
    $text = str_replace("⅛", "1/8", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[Љљ]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[fifl]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text);
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("₤", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);


    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);


    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);

    return $text;
}

?>

4 Comments

According to php.net/manual/en/function.iconv.php#74101 , that should only be an issue if you do not select a proper locale (other than C or POSIX)
there are only 128 characters in the ascii character set.
Re "the first 128 characters of the ASCII character set": ASCII only has 128: "ASCII has just 128 code points". The last bit is used for extensions, like code page Windows-1252 or ISO 8859-1.
iconv('utf-8', 'us-ascii//TRANSLIT' makes the whole string blank if any of characters was non-ASCII. It removes even good ASCII characters.
5

I also think that the best solution might be to use a regular expression.

Here's my suggestion:

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

Then you can use it like this:

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;

Displays:

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

1 Comment

FYI, Typo on line 4: '$normal_caracters' => '$normal_characters'
1

I just had to add the header

header('Content-Type: text/html; charset=UTF-8');

2 Comments

that will fix the case where UTF8 is being interpreted as WIN-1252 which is the default encoding for HTML, however it will not remove any characters from a string.
They probably don't have control over the website: "I'm getting strange characters when pulling data from a website:"
0

My problem is solved

$text = 'Châu Thái  Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái  Nhân 12/09/2022

1 Comment

What is the result? What does it do? Completely wipes out the characters? Removes non-printable characters? Please explain your solution. From the Help Center: "...always explain why the solution you're presenting is appropriate and how it works". Please respond by editing (changing) your answer, not here in comments (*** *** *** *** *** *** *** *** *** *** without *** *** *** *** *** *** *** *** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today).
0

This should be pretty straightforward and there isn't any need for an iconv function:

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));

// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

Comments

-1

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on Unicode.

$name = "βγδεζηΘKgfgebhjrf!@#$%^&";
// This function will clear all non greek and english characters on greek-iso charset
function replace_characters($string)
{
    $str_length = strlen($string);
    for ($x=0; $x < $str_length; $x++)
    {
        $character = $string[$x];
        if ((ord($character)  >  64 && ord($character) <   91) ||
            (ord($character)  >  96 && ord($character) <  123) ||
            (ord($character)  > 192 && ord($character) <  210) ||
            (ord($character)  > 210 && ord($character) <  218) ||
            (ord($character)  > 219 && ord($character) <  250) ||
             ord($character) == 252 || ord($character) == 254)
        {
            $new_string = $new_string.$character;
        }
    }
    return $new_string;
}
// End function

$name = replace_characters($name);

echo $name;

2 Comments

Heavy-handed but tweakable... I like it.
You're doing ord() on the same character over and over again just for different comparisons (line 9). That's extremely inefficient. You should save result of ord() in variable and then reuse it in conditional. Also, consider using === instead of == as use of == is discouraged. Although I don't blame you for this, ironically PHP manual for ord() shows using == in examples.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.