In Java, how to copy data from String to char[]/byte[] efficiently?

Question

I need to copy many big and different String strs' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.

For the reason above, str.toCharArray() was banned, since it allocates space for every String.

As we all know, charAt(i) is more slowly and more complex than using square brackets [i]. So I want to use byte[] or char[].

One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin). But the bad news is it was (or is to be?) deprecated.

So how can we finish this demanding job?

charAt isn't that slow as it directly returns the value from the internal char[] from the String. Memory-wise it is the most efficient as it doesn't allocate a new char[] or byte[] which is what happens with toCharArray or getBytes. — M. Deinum
– M. Deinum, Commented Aug 25, 2020 at 9:30
@M.Deinum Uh, this task is quite huge, we need to cut down many possible time costs. As far as I see the message via JProfile, about 1/9 time comes from charAt. The function call may have had me. — SHP
– SHP, Commented Aug 25, 2020 at 9:33
Just looking at 1 number isn't enough, there will always be a call taking more time as everything else. But if you start copying you will get increased GC hits which increase the time. So the only way to properly test is with enough load and warmup (you might need JMH for it). — M. Deinum
– M. Deinum, Commented Aug 25, 2020 at 9:40

Jon Skeet · Accepted Answer · 2020-08-25 09:33:23Z

6

I believe you want getChars(int, int, char[], int). That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".

You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.

answered Aug 25, 2020 at 9:33

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

SHP Over a year ago

Ah oh, My eyes fooled me. Maybe I'd better go get a brand new pair of glasses. Thanks for you help.

Arvind Kumar Avinash Over a year ago

@SHP - Before you go get a brand new pair of glasses 😀, do not forget to mark the answer as accepted so that future visitors can also use the solution confidently.

M. Deinum Over a year ago

This still needs a new char[] to copy into, how is that different than toCharArray?

Jon Skeet Over a year ago

@M.Deinum: The OP can reuse an existing array for multiple strings, which matches the description in the question: "I need to copy many big and different String strs' content to a static char array".

Joop Eggen · Accepted Answer · 2020-08-25 10:02:03Z

A small stocktaking:

String does Unicode text; it can be normalized (java.text.Normalizer).
int[] code points are Unicode symbols
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.
byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.

Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.

When dealing with Asian scripts, int code points probably is most feasible. Otherwise bytes seem best.

Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.

Performance would best be done in bytes as often most compact. UTF-8 probably.

One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.

What one can do, is using fast ByteBuffers.

Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.

GhostCat · Accepted Answer · 2020-08-25 09:31:05Z

1

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.

In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.

Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!

In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.

But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.

answered Aug 25, 2020 at 9:31

GhostCat

141k28 gold badges190 silver badges262 bronze badges

1 Comment

SHP Over a year ago

This is a very interesting thinking. Maybe I should check if I can avoid using String from the very beginning. Thanks for your suggestion!

JustAnotherDeveloper · Accepted Answer · 2020-08-25 09:52:29Z

0

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. The relevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.

edited Aug 25, 2020 at 9:52

answered Aug 25, 2020 at 9:28

JustAnotherDeveloper

2,2662 gold badges16 silver badges29 bronze badges

4 Comments

SHP Over a year ago

That hurts. The input is totally unknown, so I'll give that a ... no, sorry : (

JustAnotherDeveloper Over a year ago

I mean, if it's all the same to you whether you store it in byte[]or char[], then the 4-argument version of getChars is not deprecated.

SHP Over a year ago

Yes, that's exactly what I want. Thanks for your help 2

JustAnotherDeveloper Over a year ago

Excellent. I'll add it to my answer then, so it's more useful to people that might have the same problem in the future.

Collectives™ on Stack Overflow

In Java, how to copy data from String to char[]/byte[] efficiently?

4 Answers 4

4 Comments

Comments

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

Comments

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related