Remove non-ASCII characters from string

Question

I'm getting strange characters when pulling data from a website:

Â

How can I remove anything that isn't a non-extended ASCII character?

A more appropriate question can be found here: PHP - replace all non-alphanumeric chars for all languages supported

What do you mean when you say non-ascii, Â is an ascii character (#194) — Drew Galbraith
– Drew Galbraith, Commented Jan 8, 2012 at 22:30
oh. well, I mean things like letters and characters such as $(#*@. I don't know how to explain it other than I only want characters you'd be able to type on your keyboard. — LordZardeck
– LordZardeck, Commented Jan 8, 2012 at 22:32
I can type "あいうえお" on my keyboard... Maybe you just have an encoding problem and should interpret the text in the right encoding instead of removing things? — deceze
– deceze ♦, Commented Jan 8, 2012 at 22:52

Chris Bornhoft · Accepted Answer · 2018-03-05 15:06:35Z

128

A regex replace would be the best option. Using $str as an example string and matching it using :print:, which is a POSIX Character Class:

$str = 'aAÂ';
$str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

What :print: does is look for all printable characters. The reverse, :^print:, looks for all non-printable characters. Any characters that are not part of the current character set will be removed.

Note: Before using this method, you must ensure that your current character set is ASCII. POSIX Character Classes support both ASCII and Unicode and will match only according to the current character set. As of PHP 5.6, the default charset is UTF-8.

edited Mar 5, 2018 at 15:06

answered Jan 8, 2012 at 22:34

Chris Bornhoft

4,3095 gold badges40 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

DamirR Over a year ago

This solution is not working for me. :( I am getting aAÂ. php 5.3.0. (windows)

vcardillo Over a year ago

How do you make ASCII the selected character set via code?

Jasen Over a year ago

yes, this answer only works on misconfigured systems 'Â' is clearly a printing character:(it is both inked, and consumes space) use '/[[:^ascii:]]/'' instead of '/[[:^print:]]/' to strip non-ascii.

Hobbes Over a year ago

Jasen, your correction was the right solution for me at least.

Chris Bornhoft Over a year ago

@KasparL.Palgi that is exactly what the original question asked to accomplish: remove the characters completely. To replace with an non-accented character, you would need to create a custom mapping of the characters you'd like to replace first.

|

Peter Mortensen · Accepted Answer · 2023-05-03 14:46:26Z

54

Do you want only ASCII printable characters?

Use this:

<?php
header('Content-Type: text/html; charset=UTF-8');
$str = "abqwrešđčžsff";
$res = preg_replace('/[^\x20-\x7E]/', '', $str);
echo "($str)($res)";

Or even better, convert your input to UTF-8 and use phputf8 lib to translate 'not normal' characters into their ASCII representation:

require_once('libs/utf8/utf8.php');
require_once('libs/utf8/utils/bad.php');
require_once('libs/utf8/utils/validation.php');
require_once('libs/utf8_to_ascii/utf8_to_ascii.php');

if(!utf8_is_valid($str))
{
  $str = utf8_bad_strip($str);
}

$str = utf8_to_ascii($str, '');

edited May 3, 2023 at 14:46

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Jan 8, 2012 at 22:51

DamirR

1,7251 gold badge14 silver badges15 bronze badges

2 Comments

John Langford Over a year ago

I also wanted to keep the tab character, so I used this regular expression: [^\x00-\x7E]

user6096790 Over a year ago

Thank you! So much better than the accepted answer! over 10 years later, this saved me a lot of grief!

Peter Mortensen · Accepted Answer · 2023-05-03 15:05:13Z

38

Use:

$clearstring = filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

Note that FILTER_SANITIZE_STRING is deprecated since PHP 8.1.

edited May 3, 2023 at 15:05

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Aug 24, 2015 at 8:46

Utopia

6637 silver badges8 bronze badges

5 Comments

user414873 Over a year ago

Seems perfect for PHP >= 5.2

ds00424 Over a year ago

This seems to also strip tags. For me it was removing <%AnyTextHere%> See PHP Sanitize filters

ᴍᴇʜᴏᴠ Over a year ago

Heads up: if you go to functions-online.com to test this, it will put single quotes around FILTER_FLAG_STRIP_HIGH which stops it from working

bhar1red Over a year ago

This was helpful. Though I used FILTER_FLAG_ENCODE_HIGH instead of FILTER_FLAG_STRIP_HIGH

Oleg Over a year ago

FILTER_SANITIZE_STRING is deprecated since PHP 8.1

Peter Mortensen · Accepted Answer · 2023-05-03 14:48:59Z

Kind of related: We had a web application that had to send data to a legacy system that could only deal with the first 128 characters of the ASCII character set.

The solution we had to use was something that would "translate" as many characters as possible into close-matching ASCII equivalents, but leave anything that could not be translated alone.

Normally I would do something like this:

<?php
// transliterate
if (function_exists('iconv')) {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }
?>

... but that replaces everything that can't be translated into a question mark (?).

So we ended up doing the following. Check at the end of this function for (commented out) php regex that just strips out non-ASCII characters.

<?php
public function cleanNonAsciiCharactersInString($orig_text) {

    $text = $orig_text;

    // Single letters
    $text = preg_replace("/[∂άαáàâãªä]/u",      "a", $text);
    $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u",     "A", $text);
    $text = preg_replace("/[ЂЪЬБъь]/u",           "b", $text);
    $text = preg_replace("/[βвВ]/u",            "B", $text);
    $text = preg_replace("/[çς©с]/u",            "c", $text);
    $text = preg_replace("/[ÇС]/u",              "C", $text);
    $text = preg_replace("/[δ]/u",             "d", $text);
    $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text);
    $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u",     "E", $text);
    $text = preg_replace("/[₣]/u",               "F", $text);
    $text = preg_replace("/[НнЊњ]/u",           "H", $text);
    $text = preg_replace("/[ђћЋ]/u",            "h", $text);
    $text = preg_replace("/[ÍÌÎÏ]/u",           "I", $text);
    $text = preg_replace("/[íìîïιίϊі]/u",       "i", $text);
    $text = preg_replace("/[Јј]/u",             "j", $text);
    $text = preg_replace("/[ΚЌК]/u",            'K', $text);
    $text = preg_replace("/[ќк]/u",             'k', $text);
    $text = preg_replace("/[ℓ∟]/u",             'l', $text);
    $text = preg_replace("/[Мм]/u",             "M", $text);
    $text = preg_replace("/[ñηήηπⁿ]/u",            "n", $text);
    $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u",       "N", $text);
    $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text);
    $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u",     "O", $text);
    $text = preg_replace("/[ρφрРф]/u",          "p", $text);
    $text = preg_replace("/[®яЯ]/u",              "R", $text);
    $text = preg_replace("/[ГЃгѓ]/u",              "r", $text);
    $text = preg_replace("/[Ѕ]/u",              "S", $text);
    $text = preg_replace("/[ѕ]/u",              "s", $text);
    $text = preg_replace("/[Тт]/u",              "T", $text);
    $text = preg_replace("/[τ†‡]/u",              "t", $text);
    $text = preg_replace("/[úùûüџμΰµυϋύ]/u",     "u", $text);
    $text = preg_replace("/[√]/u",               "v", $text);
    $text = preg_replace("/[ÚÙÛÜЏЦц]/u",         "U", $text);
    $text = preg_replace("/[Ψψωώẅẃẁщш]/u",      "w", $text);
    $text = preg_replace("/[ẀẄẂШЩ]/u",          "W", $text);
    $text = preg_replace("/[ΧχЖХж]/u",          "x", $text);
    $text = preg_replace("/[ỲΫ¥]/u",           "Y", $text);
    $text = preg_replace("/[ỳγўЎУуч]/u",       "y", $text);
    $text = preg_replace("/[ζ]/u",              "Z", $text);

    // Punctuation
    $text = preg_replace("/[‚‚]/u", ",", $text);
    $text = preg_replace("/[`‛′’‘]/u", "'", $text);
    $text = preg_replace("/[″“”«»„]/u", '"', $text);
    $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text);
    $text = preg_replace("/[  ]/u", ' ', $text);

    $text = str_replace("…", "...", $text);
    $text = str_replace("≠", "!=", $text);
    $text = str_replace("≤", "<=", $text);
    $text = str_replace("≥", ">=", $text);
    $text = preg_replace("/[‗≈≡]/u", "=", $text);


    // Exciting combinations
    $text = str_replace("ыЫ", "bl", $text);
    $text = str_replace("℅", "c/o", $text);
    $text = str_replace("₧", "Pts", $text);
    $text = str_replace("™", "tm", $text);
    $text = str_replace("№", "No", $text);
    $text = str_replace("Ч", "4", $text);
    $text = str_replace("‰", "%", $text);
    $text = preg_replace("/[∙•]/u", "*", $text);
    $text = str_replace("‹", "<", $text);
    $text = str_replace("›", ">", $text);
    $text = str_replace("‼", "!!", $text);
    $text = str_replace("⁄", "/", $text);
    $text = str_replace("∕", "/", $text);
    $text = str_replace("⅞", "7/8", $text);
    $text = str_replace("⅝", "5/8", $text);
    $text = str_replace("⅜", "3/8", $text);
    $text = str_replace("⅛", "1/8", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[Љљ]/u", "Ab", $text);
    $text = preg_replace("/[Юю]/u", "IO", $text);
    $text = preg_replace("/[ﬁﬂ]/u", "fi", $text);
    $text = preg_replace("/[зЗ]/u", "3", $text);
    $text = str_replace("£", "(pounds)", $text);
    $text = str_replace("₤", "(lira)", $text);
    $text = preg_replace("/[‰]/u", "%", $text);
    $text = preg_replace("/[↨↕↓↑│]/u", "|", $text);
    $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text);


    //2) Translation CP1252.
    $trans = get_html_translation_table(HTML_ENTITIES);
    $trans['f'] = '&fnof;';    // Latin Small Letter F With Hook
    $trans['-'] = array(
        '&hellip;',     // Horizontal Ellipsis
        '&tilde;',      // Small Tilde
        '&ndash;'       // Dash
        );
    $trans["+"] = '&dagger;';    // Dagger
    $trans['#'] = '&Dagger;';    // Double Dagger
    $trans['M'] = '&permil;';    // Per Mille Sign
    $trans['S'] = '&Scaron;';    // Latin Capital Letter S With Caron
    $trans['OE'] = '&OElig;';    // Latin Capital Ligature OE
    $trans["'"] = array(
        '&lsquo;',  // Left Single Quotation Mark
        '&rsquo;',  // Right Single Quotation Mark
        '&rsaquo;', // Single Right-Pointing Angle Quotation Mark
        '&sbquo;',  // Single Low-9 Quotation Mark
        '&circ;',   // Modifier Letter Circumflex Accent
        '&lsaquo;'  // Single Left-Pointing Angle Quotation Mark
        );

    $trans['"'] = array(
        '&ldquo;',  // Left Double Quotation Mark
        '&rdquo;',  // Right Double Quotation Mark
        '&bdquo;',  // Double Low-9 Quotation Mark
        );

    $trans['*'] = '&bull;';    // Bullet
    $trans['n'] = '&ndash;';    // En Dash
    $trans['m'] = '&mdash;';    // Em Dash
    $trans['tm'] = '&trade;';    // Trade Mark Sign
    $trans['s'] = '&scaron;';    // Latin Small Letter S With Caron
    $trans['oe'] = '&oelig;';    // Latin Small Ligature OE
    $trans['Y'] = '&Yuml;';    // Latin Capital Letter Y With Diaeresis
    $trans['euro'] = '&euro;';    // euro currency symbol
    ksort($trans);

    foreach ($trans as $k => $v) {
        $text = str_replace($v, $k, $text);
    }

    // 3) remove <p>, <br/> ...
    $text = strip_tags($text);

    // 4) &amp; => & &quot; => '
    $text = html_entity_decode($text);


    // transliterate
    // if (function_exists('iconv')) {
    // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    // }

    // remove non ascii characters
    // $text =  preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text);

    return $text;
}

?>

According to php.net/manual/en/function.iconv.php#74101 , that should only be an issue if you do not select a proper locale (other than C or POSIX)
Re "the first 128 characters of the ASCII character set": ASCII only has 128: "ASCII has just 128 code points". The last bit is used for extensions, like code page Windows-1252 or ISO 8859-1.
iconv('utf-8', 'us-ascii//TRANSLIT' makes the whole string blank if any of characters was non-ASCII. It removes even good ASCII characters.

simhumileco · Accepted Answer · 2018-10-17 12:57:12Z

5

I also think that the best solution might be to use a regular expression.

Here's my suggestion:

function convert_to_normal_text($text) {

    $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]";
    $normal_text = preg_replace("/[^$normal_characters]/", '', $text);

    return $normal_text;
}

Then you can use it like this:

$before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.';
$after = convert_to_normal_text($before);
echo $after;

Displays:

Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

edited Oct 17, 2018 at 12:57

answered Aug 17, 2016 at 12:20

simhumileco

35.3k18 gold badges148 silver badges125 bronze badges

1 Comment

ds00424 Over a year ago

FYI, Typo on line 4: '$normal_caracters' => '$normal_characters'

nhahtdh · Accepted Answer · 2013-09-10 16:40:12Z

1

I just had to add the header

header('Content-Type: text/html; charset=UTF-8');

edited Sep 10, 2013 at 16:40

nhahtdh

56.9k15 gold badges131 silver badges164 bronze badges

answered Sep 10, 2013 at 16:24

ALHaines

112 bronze badges

2 Comments

Jasen Over a year ago

that will fix the case where UTF8 is being interpreted as WIN-1252 which is the default encoding for HTML, however it will not remove any characters from a string.

Peter Mortensen Over a year ago

They probably don't have control over the website: "I'm getting strange characters when pulling data from a website:"

Nhan Chau KP · Accepted Answer · 2022-09-12 08:42:33Z

0

My problem is solved

$text = 'Châu Thái  Nhân 12/09/2022';
echo preg_replace('/[\x00-\x1F\x7F]/', '', $text);
//Châu Thái  Nhân 12/09/2022

answered Sep 12, 2022 at 8:42

Nhan Chau KP

1

1 Comment

Peter Mortensen Over a year ago

What is the result? What does it do? Completely wipes out the characters? Removes non-printable characters? Please explain your solution. From the Help Center: "...always explain why the solution you're presenting is appropriate and how it works". Please respond by editing (changing) your answer, not here in comments (*** *** *** *** *** *** *** *** *** *** without *** *** *** *** *** *** *** *** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today).

Peter Mortensen · Accepted Answer · 2023-05-03 14:58:18Z

0

This should be pretty straightforward and there isn't any need for an iconv function:

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string));

// Replace all separator characters and whitespace by a single separator
$string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

edited May 3, 2023 at 14:58

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Mar 13, 2015 at 7:30

Goran Jakovljevic

2,8201 gold badge34 silver badges27 bronze badges

Comments

Peter Mortensen · Accepted Answer · 2023-05-03 15:01:52Z

-1

I think the best way to do something like this is by using ord() command. This way you will be able to keep characters written in any language. Just remember to first test your text's ord results. This will not work on Unicode.

$name = "βγδεζηΘKgfgebhjrf!@#$%^&";
// This function will clear all non greek and english characters on greek-iso charset
function replace_characters($string)
{
    $str_length = strlen($string);
    for ($x=0; $x < $str_length; $x++)
    {
        $character = $string[$x];
        if ((ord($character)  >  64 && ord($character) <   91) ||
            (ord($character)  >  96 && ord($character) <  123) ||
            (ord($character)  > 192 && ord($character) <  210) ||
            (ord($character)  > 210 && ord($character) <  218) ||
            (ord($character)  > 219 && ord($character) <  250) ||
             ord($character) == 252 || ord($character) == 254)
        {
            $new_string = $new_string.$character;
        }
    }
    return $new_string;
}
// End function

$name = replace_characters($name);

echo $name;

edited May 3, 2023 at 15:01

Peter Mortensen

31.4k22 gold badges110 silver badges134 bronze badges

answered Apr 25, 2015 at 12:56

websolutions.gr

251 bronze badge

2 Comments

Kristen Waite Over a year ago

Heavy-handed but tweakable... I like it.

xZero Over a year ago

You're doing ord() on the same character over and over again just for different comparisons (line 9). That's extremely inefficient. You should save result of ord() in variable and then reuse it in conditional. Also, consider using === instead of == as use of == is discouraged. Although I don't blame you for this, ironically PHP manual for ord() shows using == in examples.

Collectives™ on Stack Overflow

Remove non-ASCII characters from string

9 Answers 9

14 Comments

2 Comments

5 Comments

4 Comments

1 Comment

2 Comments

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

14 Comments

2 Comments

5 Comments

4 Comments

1 Comment

2 Comments

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related