1

how can i to remove all characters non-language ?

i want to remove characters like this below, and all other of not language characters:



i using this:

preg_replace("/[^a-z0-9A-Z\-\'\|\!\.\?\:\)\(\;\*\"]/u", " ", $text );

this is good for english, i need to approve all language characters, like Russian,arabic,hebrew,japan...

Are there any string functions I can use to leave all language characters?

thanks

3
  • What you have there are code points in the private use area. By "non-language characters", do you mean characters that are not typically used, like private use area code points? Or any symbols, like "☃"? What about "→"? That's useful in written text. Commented Jan 25, 2012 at 11:32
  • yes, i want to remove all symbols and other are not typically used in regular keyboard, like A-Z i'm using, but for all languages Commented Jan 25, 2012 at 11:35
  • How far do you want to go for "text"? There are giant sections for lots of typography related things, which is arguably language related. What's the primary goal/reason for this? Commented Jan 25, 2012 at 11:36

2 Answers 2

11

No regex will be perfect for what you want - language and writing are just too complex for this. But an approximation could be

preg_replace('/[^\p{L}\p{M}\p{Z}\p{N}\p{P}]/u', ' ', $text);

This will replace anything by a space that's not a Unicode character with one of the properties “letter”, “mark”, “separator”, “number” or “punctuation”.

Sign up to request clarification or add additional context in comments.

Comments

1

Tim Pietzcker's answer not working in my case.

This works.

$after = preg_replace('/[^\w\s]+/u','' , $before);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.