1

I need a function that removes all characters (not listed in pattern) from string but keeps foreign language letters. I know preg_replace has \p "pattern" but I can't get it working for some reason.

I use this function to remove all the crap from string:

$main_content=preg_replace("/[^a-zA-Z0-9`~!@#\$%\^&\*\(\)-_=\+\\|\,<\.>\/\?;:'\"\[\]\s]/", "", $main_content); //remove all symbols that do NOT match these

Put simply, the function should keep all the standard letters/numbers and standard symbols like +-!@#$ and so on, and remove all the crap like © ™ and so on. If there is a better way to write such preg_replace than I use, please let me know.

Now, I want the function to keep letters in foreign languages, so I modified it to

$main_content=preg_replace("/[^\p{L}a-zA-Z0-9`~!@#\$%\^&\*\(\)-_=\+\\|\,<\.>\/\?;:'\"\[\]\s]/", "", $main_content); //remove all symbols that do NOT match these

(You will notice \p{L} added). Unfortunately, it didn't work as expected. When I echo the text, I see that foreign languages were not removed (that's good) but they were converted into � (that's bad).

How do I fix it?

7
  • What is 'foreign language letters'? Commented Sep 12, 2013 at 11:16
  • Is your php script utf8 encoded? Commented Sep 12, 2013 at 11:17
  • Can you show an example of input and expected output? Commented Sep 12, 2013 at 11:18
  • Yes, it's utf8 encoded (like I said, if I echo text text before using preg_replace, everything is perfect). Here are some foreign letters: ą č ę ė į š ų ų (Lithuanian) Commented Sep 12, 2013 at 11:18
  • sure thing, here's the example - lipskas.com/test.php Commented Sep 12, 2013 at 11:22

2 Answers 2

3

\p{L} is available only with u modifier:

$main_content=preg_replace("/[^\p{L}]/u", "", $main_content);

Notice the u added after /

Sign up to request clarification or add additional context in comments.

4 Comments

I shortened the example. And there's negation, so it will remove all except the letters.
This one seem to be working, except it removes space and other symbols like !@#, etc. Guess I'll need to add them to allowed symbols list.
Yes, as I said, it's shortened ;) And \p{L} matches all letters, no need to mention a-z
I was going to add some allowed symbols to the list, and then I "combined" some pattern (from different examples found on the Internet) that seems to be working too. In case someone needs it - $main_content=preg_replace('/[^\p{L}[:print:]]/u', "", $main_content); This function allows ALL printable characters in any language and removes crap like © ™...
0

Thought I would expand on the top answer here as there is a bit more to it if you are dealing with certain unicode languages and a Google heads you to this page first.


Preg replace allows you to use p{L} to signify any character from any language but this is not always the single solution.

There is p{N} which covers any kind of numeric character too.

Finally, you should also be aware of other characters in certain languages such as Thailand and Arab countries where they have "tone" or "emotion" characters that p{L} alone does not cover.

For Thailand you can add \p{Thai} which then allows the special tone characters which are basically the characters that have little glyphs hovering above them.

This example below will replace everything but any alpha-numeric character from any language and also not replace any tone characters in Thailand.

Why are the tones characters not included as part of the any character list? I have no idea in all honesty.

$str = preg_replace("/[^\p{L}\p{N}\p{Thai}]/u","",$str);

/*
List of the others:-

\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}
*/

This great article covers the all of it in brilliant detail and gives you loads of examples to try. https://www.regular-expressions.info/unicode.html#category

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.