convert unicode string to ASCII in java which works in unix/linux

Question

I have already tried using Normalizer

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

System.out.println(s2);
System.out.println(s.length() == s2.length());

i want it to work in Unix/Linux ,

you mean to say the regex is for utf -8

anshulkatta
– anshulkatta

2014-06-26 06:37:49 +00:00
Commented Jun 26, 2014 at 6:37 — anshulkatta
– anshulkatta, Commented Jun 26, 2014 at 6:37
i got this from stackoverflow.com/questions/15356716/…

anshulkatta
– anshulkatta

2014-06-26 06:38:41 +00:00
Commented Jun 26, 2014 at 6:38 — anshulkatta
– anshulkatta, Commented Jun 26, 2014 at 6:38

Community · Accepted Answer · 2017-05-23 12:20:44Z

1

There is an ASCII character class for matching code points in the ASCII set:

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");

System.out.println(s2);
System.out.println(s.length() == s2.length());

As Joop Eggan notes, Java string and char types are always UTF-16. You can only have ASCII-encoded data in byte form:

byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);

edited May 23, 2017 at 12:20

CommunityBot

11 silver badge

answered Jun 26, 2014 at 8:26

McDowell

109k31 gold badges207 silver badges272 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joop Eggen Over a year ago

Short and simple. The length comparison in general does not make sense. But it seems this is the answer. Any other problems are located elsewhere. However a conversion still might make sense, as some decoders may substitute special quotes (“ ”) and such by the ASCII quotes.

Joop Eggen · Accepted Answer · 2014-06-30 06:25:55Z

0

Explanation

First in java text (String/Reader/Writer) is already Unicode. For the java source code (String literals) the editor and the javac compiler should use the same encoding. Ideally UTF-8.

The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. Converting text with accents like ä é ﬁ ﬂ ĉ œ to a e fi fl c oe to ASCII.

Hence you would get - I think - "??? hello A".

Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);

To prevent receiving the question marks (and distinguishing between a ? in the original string), you can use a Charset.newDecoder().

For ASCII you would still need some transliteration to latin script.

Answer

As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do:

System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);

Here s is converted to the operating system encoding.

edited Jun 30, 2014 at 6:25

answered Jun 26, 2014 at 7:24

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

4 Comments

anshulkatta Over a year ago

yes exactly..i want to prevent ? character , i want it to print the PROJEçãO character to PROJECAO , actually i am able to convert it when i type the PROJEçãO to a string object , but when i read from file it prints with PROJE???

Karol S Over a year ago

"The regular expression converts text like ä é ß ĉ œ to a e ss c oe, with accents, to ASCII." What do you mean by that? Normalizer does not split ligatures (ß or œ) nor many letters with diacritics (ø or ł).

Joop Eggen Over a year ago

@KarolS yes, bad formulation (partly corrected); intended to clarify should it not be clear. I did not find in the characer map the ff or fi ligatures; as using string length is not so good an idea.

Karol S Over a year ago

ﬀ is \uFB00, and this ligature is normalized to ff. But still, normalizer doesn't do anything to ß. Normalizer("ß", Normalizer.Form.WHICHEVER) returns "ß", not "ss", and the regex does nothing afterwards.

Collectives™ on Stack Overflow

convert unicode string to ASCII in java which works in unix/linux

2 Answers 2

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related