Java find the index of string based on the utf-8 encoding index

Question

Consider the following string:

String text="un’accogliente villa del.";

I have the begin index of word "accogliente" which is 5. But it is pre calculated based on utf-8 encoding.

I want the exact index of the word , which is 3 as output. ie, I want to get 3 as output from 5. What is the best way of calculating it?

If i understood you correctly, why did you not use indexOf, which is correctly giving 3? — Glains
– Glains, Commented Aug 6, 2018 at 13:32
I have edited the question. I dont have the word accogliente. I only have the sentence and index of utf-8 , ie,5 . from that values i need to find 3. @Eugene — din_oops
– din_oops, Commented Aug 6, 2018 at 13:34
so you have the sentence and an startIndex = 5. you want to get the index where the word containing that startIndex (5) is positioned? — Eugene
– Eugene, Commented Aug 6, 2018 at 13:38
I have the begin index of word "accogliente" which is 5 what is this suppose to mean? voting to close as unclear... — Eugene
– Eugene, Commented Aug 6, 2018 at 13:43

Joop Eggen · Accepted Answer · 2018-08-06 15:39:49Z

3

String text = "un’accogliente villa del."; // Unicode text
text = Normalizer.normalize(text, Form.NFC); // Normalize text

byte[] bytes = text.getBytes(StandardCharsets.UTF_8); // Index 5 UTF-8; 1 byte
char[] chars = text.toCharArray();                    // Index 3 UTF-16; 2 bytes (indexOf)
int[] codePoints = text.codePoints().toArray();       // Index 3 UTF-32; 4 bytes

int charIndex = text.indexOf("accogliente");
int codePointIndex = (int) text.substring(0, charIndex).codePoints().count();
int byteIndex = text.substring(0, charIndex).getBytes(StandardCharsets.UTF_8).length;

UTF-32 is the Unicode code points, the numbering of all symbols with U+XXXX where there maybe more (or less) than 4 hexadecimal digits.

Text normalisation is needed as é could be one code point, or two code points, a zero-width ´ followed by a e.

The question of UTF-8 byte index to UTF-16 char index:

int charIndex = new String(text.getBytes(StandardCharsets.UTF_8),
                           0, byteIndex, StandardCharsets.UTF_8).length();

edited Aug 6, 2018 at 15:39

answered Aug 6, 2018 at 13:45

Joop Eggen

110k8 gold badges89 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Eugene Over a year ago

@JoopEggen it seems the requirement is bit different for OP, he/she has a startIndex = 5, he has to find the word (first I assume) containing this letter, then stripping non-ascii letters find the index that word is at. I think this is what he needs

Joop Eggen Over a year ago

@TweetMan Sorry typo, java.text.Normalizer and java.text.Normalizer.Form.NFKC; for the question normalisation of text is not really needed.

Joop Eggen Over a year ago

@Eugene his mention of UTF-8 seems to indicate that 5 is the byte index of acco. Especially as the special quote U+2019 indeed is 3 bytes long in UTF-8.

rav3n6 · Accepted Answer · 2018-08-06 13:37:05Z

1

Below code will return output as 3 am i missing something in your question?

String text="un’accogliente villa del.";
text.indexOf("accogliente");

answered Aug 6, 2018 at 13:37

rav3n6

272 bronze badges

2 Comments

Glains Over a year ago

OP explained that this is not what he is looking for.

rav3n6 Over a year ago

Yeah..got it ! @Glains

Eugene · Accepted Answer · 2018-08-06 14:51:41Z

1

Well assuming that this startIndex can only be a letter (ASCII one), you could do:

String text = "un’accogliente villa del.";
char c = text.charAt(5);
String normalized = Normalizer.normalize(text, Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", " ");

Pattern p = Pattern.compile("\\p{L}*?" + c + "\\p{L}*?[$|\\s]");
Matcher m = p.matcher(normalized);

if (m.find()) {
     System.out.println(m.start(0));
}

answered Aug 6, 2018 at 14:51

Eugene

122k17 gold badges219 silver badges335 bronze badges

Collectives™ on Stack Overflow

Java find the index of string based on the utf-8 encoding index

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related