How do I detect unicode characters in a Java string?

Question

Suppose I have a string that contains Ü. How would I find all those unicode characters? Should I test for their code? How would I do that?

For example, given the string "AÜXÜ", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.

How do you know what Ü will map to without your own map? There is no simple mapping and I suspect in different languages any mapping might differ — mmmmmm
– mmmmmm, Commented Nov 4, 2009 at 12:44
actually you can do it by looking at chars one by one. It depends the "range" of the char, but it's quiet low level, and I assume there already exists something to achieve this task. see en.wikipedia.org/wiki/Unicode — Aif
– Aif, Commented Nov 4, 2009 at 12:45

Dave Jarvis · Accepted Answer · 2017-10-04 20:57:00Z

18

The definition of "unicode characters" is vague, but will be taken to mean UTF-8 characters not covered by the standard ISO 8859 charset. If this is true in your case, then loop through all characters in the String and test its codepoint to determine whether it is within the given character set.

Alternatively, use a Map<Character, Character> and characters in the map that contain match the keys. For example:

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{
    put('Ü', 'Y');
    // Put more here.
}};

String originalString = "AÜAÜ";
StringBuilder builder = new StringBuilder();

for (char currentChar : originalString.toCharArray()) {
    Character replacementChar = charReplacementMap.get(currentChar);
    builder.append(replacementChar != null ? replacementChar : currentChar);
}

String newString = builder.toString();

Or, do you mean "all characters with diacritics"? If so, then use java.text.Normalizer to remove diacritical marks:

/**
 * Remove any diacritical marks (accents like ç, ñ, é, etc) from
 * the given string (so that it returns plain c, n, e, etc).
 * @param string The string to remove diacritical marks from.
 * @return The string with removed diacritical marks, if any.
 */
public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

One pitfall, Ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic.

edited Oct 4, 2017 at 20:57

Dave Jarvis

31.3k43 gold badges186 silver badges326 bronze badges

answered Nov 4, 2009 at 12:48

BalusC

1.1m377 gold badges3.7k silver badges3.6k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Geo Over a year ago

It's how I usually did it. But this would require you add each character in the map.

BalusC Over a year ago

I don't see any other efficient option to replace a certain character by a certain character and that for more than one character.

C. Ross Over a year ago

If you don't add each character to the map, how do you define the replacement? Or do you want all non-ascii characters replaced by a single ascii character?

Stephen C Over a year ago

@BalusC - actually, the real definition of what is a Unicode character (codepoint) is very precise. The problem is that the OP does not understand that the ASCII characters are a proper subset of the Unicode codepoints.

BalusC Over a year ago

Or do you just want to remove diacritical marks? I've edited my post with it.

|

jitter · Accepted Answer · 2009-11-04 12:48:53Z

17

You could loop through your string and for every character call

If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
 // replace with Y
}

answered Nov 4, 2009 at 12:48

jitter

54.8k11 gold badges114 silver badges130 bronze badges

2 Comments

BalusC Over a year ago

Good one to test codepoints, but I don't have the impression that he want to replace every character by Y.

jitter Over a year ago

Well he says unicode characters by that I understand that he probably means replace all non ascii characters with Y. whatever

msp · Accepted Answer · 2019-03-04 08:08:12Z

12

You could go the other way round and ask if the character is an ascii character.

public static boolean isAscii(char ch) {
    return ch < 128;
}

You'd have to analyse the string char by char then of course.

(the method is from commons-lang CharUtils which contains loads of useful Character methods)

edited Mar 4, 2019 at 8:08

answered Nov 4, 2009 at 12:44

msp

3,3717 gold badges40 silver badges50 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:17:40Z

2

It isn't clear to me exactly what is gained by transforming "AÜXÜ" to "AYXY". Is this because Ü is pronounced like Y in a particular language? What language? And what other rules might apply?

In terms of terminology...

"a"

The above is a Unicode string. It contains a single UTF-16 encoded character.

If you wish to limit the range of characters to the English alphabet, have a look at the Normalization performed in this answer.

edited May 23, 2017 at 12:17

CommunityBot

11 silver badge

answered Nov 4, 2009 at 12:50

McDowell

109k31 gold badges207 silver badges272 bronze badges

1 Comment

Geo Over a year ago

It was just a replacement example. I'll actually replace the character by _XX_ :)

Aliaxander · Accepted Answer · 2017-06-06 09:43:02Z

2

The class Character also offers some interesting methods. Take a look at it.

Character.UnicodeBlock.of('a') == Character.UnicodeBlock.BASIC_LATIN; //true

Character.UnicodeBlock.of('�') == Character.UnicodeBlock.BASIC_LATIN; //false

edited Jun 6, 2017 at 9:43

Aliaxander

2,6374 gold badges24 silver badges49 bronze badges

answered Jun 6, 2017 at 9:28

Bhanu PS Kushwah

898 bronze badges

Comments

Dominic Rodger · Accepted Answer · 2009-11-04 12:45:46Z

1

I'm not sure from your example what you're trying to do - if you're just trying to replace all non-ASCII values with Y, then you could loop through the string looking for codepoints outside of the range 0 to 127, and replace them those code points with Y.

answered Nov 4, 2009 at 12:45

Dominic Rodger

100k37 gold badges204 silver badges219 bronze badges

Collectives™ on Stack Overflow

How do I detect unicode characters in a Java string?

6 Answers 6

9 Comments

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

9 Comments

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related