5

We are building a Python 3 program which calls a Java program. The Java program (which is a 3rd party program we cannot modify) is used to tokenize strings (find the words) and provide other annotations. Those annotations are in the form of character offsets.

As an example, we might provide the program with string data such as "lovely weather today". It provides something like the following output:

0,6
7,14
15,20

Where 0,6 are the offsets corresponding to word "lovely", 7,14 are the offsets corresponding to the word "weather" and 15,20 are offsets corresponding to the word "today" within the source string. We read these offsets in Python to extract the text at those points and perform further processing.

All is well and good as long as the characters are within the Basic Multilingual Plane (BMP). However, when they are not, the offsets reported by this Java program show up all wrong on the Python side.

For example, given the string "I feel 🙂 today", the Java program will output:

0,1
2,6
7,9
10,15

On the Python side, these translate to:

0,1    "I"
2,6    "feel"
7,9    "🙂 "
10,15  "oday"

Where the last index is technically invalid. Java sees "🙂" as length 2, which causes all the annotations after that point to be off by one from the Python program's perspective.

Presumably this occurs because Java encodes strings internally in a UTF-16esqe way, and all string operations act on those UTF-16esque code units. Python strings, on the other hand, appear to operate on the actual unicode characters (code points). So when a character shows up outside the BMP, the Java program sees it as length 2, whereas Python sees it as length 1.

So now the question is: what is the best way to "correct" those offsets before Python uses them, so that the annotation substrings are consistent with what the Java program intended to output?

6
  • Could you be more explicit about what you are actually seeing as output? Those numbers you give are not the correct Unicode code points. Commented May 23, 2019 at 17:10
  • I didn't provide any unicode code points. Those are character offsets for the string passed to the Java program. I'll try to make that more clear in the text. Commented May 23, 2019 at 17:12
  • Then we'll need the actual data. Not more discussion. We need to see what the program is actually returning as output before we have any chance of guessing how we might read it. Commented May 23, 2019 at 17:13
  • I'm not quite following. I've provided an example of the Java program output. If you want the exact byte sequence (in hex) sent to the Java program for the example text "I feel 🙂 today" it is 492066656c20f09f998220746f646179 in the form of a UTF-8 encoded file that the Java program reads. Commented May 23, 2019 at 17:29
  • 1
    This isn’t Java’s fault. It’s the fault of whoever wrote that Java program. Those person(s) wrongly assumed all characters are BMP characters. There are standard ways in Java to traverse Strings by Unicode codepoints instead of UTF-16 chars. I recommend letting them know of their mistake. Commented May 23, 2019 at 17:36

2 Answers 2

5

You could convert the string to a bytearray in UTF16 encoding, then use the offsets (multiplied by 2 since there are two bytes per UTF-16 code-unit) to index that array:

x = "I feel 🙂 today"
y = bytearray(x, "UTF-16LE")

offsets = [(0,1),(2,6),(7,9),(10,15)]

for word in offsets:
  print(str(y[word[0]*2:word[1]*2], 'UTF-16LE'))

Output:

I
feel
🙂
today

Alternatively, you could convert every python character in the string individually to UTF-16 and count the number of code-units it takes. This lets you map the indices in terms of code-units (from Java) to indices in terms of Python characters:

from itertools import accumulate

x = "I feel 🙂 today"
utf16offsets = [(0,1),(2,6),(7,9),(10,15)] # from java program

# map python string indices to an index in terms of utf-16 code units
chrLengths = [len(bytearray(ch, "UTF-16LE"))//2 for ch in x]
utf16indices = [0] + list(itertools.accumulate(chrLengths))
# reverse the map so that it maps utf16 indices to python indices
index_map = dict((x,i) for i, x in enumerate(utf16indices))

# convert the offsets from utf16 code-unit indices to python string indices
offsets = [(index_map[o[0]], index_map[o[1]]) for o in utf16offsets]

# now you can just use those indices as normal
for word in offsets:
  print(x[word[0]:word[1]])

Output:

I
feel
🙂
today

The above code is messy and can probably be made clearer, but you get the idea.

Sign up to request clarification or add additional context in comments.

2 Comments

I initially thought this was great! However, it forces us to refactor all of our existing python code to operate on those bytes instead of strings... so while that's possible I'd reaaaaaallly like to avoid it... If there was a way to use this to remap those offsets from the UTF-16 offsets to the python unicode string offsets, that would be preferable...
Perfect! And probably much more efficient than my solution using for loops. Thank you!
1

This solves the problem given the proper encoding, which, in our situation appears to be 'UTF-16BE':

def correct_offsets(input, offsets, encoding):
  offset_list = [{'old': o, 'new': [o[0],o[1]]} for o in offsets]

  for idx in range(0, len(input)):
    if len(input[idx].encode(encoding)) > 2:
      for o in offset_list:
        if o['old'][0] > idx:
          o['new'][0] -= 1
        if o['old'][1] > idx:
          o['new'][1] -= 1

  return [o['new'] for o in offset_list]

This may be pretty inefficient though. I gladly welcome any performance improvements.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.