0

I have already tried using Normalizer

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

System.out.println(s2);
System.out.println(s.length() == s2.length());

i want it to work in Unix/Linux ,

2

2 Answers 2

1

There is an ASCII character class for matching code points in the ASCII set:

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");

System.out.println(s2);
System.out.println(s.length() == s2.length());

As Joop Eggan notes, Java string and char types are always UTF-16. You can only have ASCII-encoded data in byte form:

byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);
Sign up to request clarification or add additional context in comments.

1 Comment

Short and simple. The length comparison in general does not make sense. But it seems this is the answer. Any other problems are located elsewhere. However a conversion still might make sense, as some decoders may substitute special quotes (“ ”) and such by the ASCII quotes.
0

Explanation

First in java text (String/Reader/Writer) is already Unicode. For the java source code (String literals) the editor and the javac compiler should use the same encoding. Ideally UTF-8.

The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. Converting text with accents like ä é fi fl ĉ œ to a e fi fl c oe to ASCII.

Hence you would get - I think - "??? hello A".

Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);

To prevent receiving the question marks (and distinguishing between a ? in the original string), you can use a Charset.newDecoder().

For ASCII you would still need some transliteration to latin script.

Answer

As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do:

System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);

Here s is converted to the operating system encoding.

4 Comments

yes exactly..i want to prevent ? character , i want it to print the PROJEçãO character to PROJECAO , actually i am able to convert it when i type the PROJEçãO to a string object , but when i read from file it prints with PROJE???
"The regular expression converts text like ä é ß ĉ œ to a e ss c oe, with accents, to ASCII." What do you mean by that? Normalizer does not split ligatures (ß or œ) nor many letters with diacritics (ø or ł).
@KarolS yes, bad formulation (partly corrected); intended to clarify should it not be clear. I did not find in the characer map the ff or fi ligatures; as using string length is not so good an idea.
is \uFB00, and this ligature is normalized to ff. But still, normalizer doesn't do anything to ß. Normalizer("ß", Normalizer.Form.WHICHEVER) returns "ß", not "ss", and the regex does nothing afterwards.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.