0

Consider the following string:

String text="un’accogliente villa del.";

I have the begin index of word "accogliente" which is 5. But it is pre calculated based on utf-8 encoding.

I want the exact index of the word , which is 3 as output. ie, I want to get 3 as output from 5. What is the best way of calculating it?

8
  • If i understood you correctly, why did you not use indexOf, which is correctly giving 3? Commented Aug 6, 2018 at 13:32
  • "un'accogliente villa del.".indexOf("accogliente") == 3 Commented Aug 6, 2018 at 13:33
  • I have edited the question. I dont have the word accogliente. I only have the sentence and index of utf-8 , ie,5 . from that values i need to find 3. @Eugene Commented Aug 6, 2018 at 13:34
  • 1
    so you have the sentence and an startIndex = 5. you want to get the index where the word containing that startIndex (5) is positioned? Commented Aug 6, 2018 at 13:38
  • 1
    I have the begin index of word "accogliente" which is 5 what is this suppose to mean? voting to close as unclear... Commented Aug 6, 2018 at 13:43

3 Answers 3

3
String text = "un’accogliente villa del."; // Unicode text
text = Normalizer.normalize(text, Form.NFC); // Normalize text

byte[] bytes = text.getBytes(StandardCharsets.UTF_8); // Index 5 UTF-8; 1 byte
char[] chars = text.toCharArray();                    // Index 3 UTF-16; 2 bytes (indexOf)
int[] codePoints = text.codePoints().toArray();       // Index 3 UTF-32; 4 bytes

int charIndex = text.indexOf("accogliente");
int codePointIndex = (int) text.substring(0, charIndex).codePoints().count();
int byteIndex = text.substring(0, charIndex).getBytes(StandardCharsets.UTF_8).length;

UTF-32 is the Unicode code points, the numbering of all symbols with U+XXXX where there maybe more (or less) than 4 hexadecimal digits.

Text normalisation is needed as é could be one code point, or two code points, a zero-width ´ followed by a e.

The question of UTF-8 byte index to UTF-16 char index:

int charIndex = new String(text.getBytes(StandardCharsets.UTF_8),
                           0, byteIndex, StandardCharsets.UTF_8).length();
Sign up to request clarification or add additional context in comments.

3 Comments

@JoopEggen it seems the requirement is bit different for OP, he/she has a startIndex = 5, he has to find the word (first I assume) containing this letter, then stripping non-ascii letters find the index that word is at. I think this is what he needs
@TweetMan Sorry typo, java.text.Normalizer and java.text.Normalizer.Form.NFKC; for the question normalisation of text is not really needed.
@Eugene his mention of UTF-8 seems to indicate that 5 is the byte index of acco. Especially as the special quote U+2019 indeed is 3 bytes long in UTF-8.
1

Below code will return output as 3 am i missing something in your question?

String text="un’accogliente villa del.";
text.indexOf("accogliente");

2 Comments

OP explained that this is not what he is looking for.
Yeah..got it ! @Glains
1

Well assuming that this startIndex can only be a letter (ASCII one), you could do:

String text = "un’accogliente villa del.";
char c = text.charAt(5);
String normalized = Normalizer.normalize(text, Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", " ");

Pattern p = Pattern.compile("\\p{L}*?" + c + "\\p{L}*?[$|\\s]");
Matcher m = p.matcher(normalized);

if (m.find()) {
     System.out.println(m.start(0));
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.