0

By Default Java stores strings in UTF-16, in my application it is using huge memory. One of the suggestion we get is to convert UTF-16 to UTF-8 so some memory can be saved. is this True ?

If yes Can I convert it like this when I'm fetching it from DB?

new String(rs.getBytes("MY_COLUMNNAME"), StandardCharsets.UTF_8);

I tried a sample program to check the memory by googling by I don't see any difference in the size, am I going in correct way any leads are appreciated. below is the code snippet I tried

import java.lang.reflect.Field;
import java.nio.charset.StandardCharsets;

public class TestClass {

    String value1;
    String value2;
    String value3;
    String value4;

    public TestClass(String x, String y, String z, String p) {
        this.value1 = x;
        this.value2= y;
        this.value3 = z;
        this.value4 = p;
    }

    public static long estimateObjectSize(Object obj) {
        long size = 0;

        for (Field field : obj.getClass().getDeclaredFields()) {
            field.setAccessible(true);

            Class<?> type = field.getType();

            if (type.isPrimitive()) {
                size += primitiveSize(type);
            } else {
                size += referenceSize();
            }
        }

        size += objectHeaderSize();

        return size;
    }

    private static long primitiveSize(Class<?> type) {
        if (type == boolean.class || type == byte.class) {
            return 1;
        } else if (type == char.class || type == short.class) {
            return 2;
        } else if (type == int.class || type == float.class) {
            return 4;
        } else if (type == long.class || type == double.class) {
            return 8;
        } else {
            throw new IllegalArgumentException("Unsupported primitive type: " + type);
        }
    }

    private static long referenceSize() {
        return 8;
    }

    private static long objectHeaderSize() {
        return 16;
    }

    public static void main(String[] args) {
        TestClass obj = new TestClass(new String("ABC".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8), new String("XYZ".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8),new String("XYZ".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8), new String("XYZ".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
        TestClass obj = new TestClass("A", "B","C","D"); 

        long size = estimateObjectSize(obj);
        System.out.println("Estimated size of the object: " + size + " bytes");
    }
}
3
  • 3
    Does this answer your question? What is Java 9's new string implementation? Commented Mar 28, 2024 at 18:36
  • I have never known an application to be significantly helped by optimizing its strings. (In the 1990s, there was a project that tried to provide an alternate String class for this. It didn’t last long.) You probably are better off analyzing your program and figure out how to keep fewer objects in program memory in the first place. Commented Mar 28, 2024 at 22:14
  • 1
    Your code to calculate the object size is pointless, as the characters are stored in an array referenced by the string object, not in the string object itself. Further, you’re adding the size of static fields to the object size and assume uncommon implementation details. The most common configuration is 64 bit with compressed oop & class pointers, where the object header is 12 bytes and the reference size 4 bytes. If you ever try your method on other classes than String, you should also care for the fields of the superclasses. And, there’s no need for setAccessible(true) here. Commented Apr 11, 2024 at 17:40

2 Answers 2

1
new String(rs.getBytes("MY_COLUMNNAME"), StandardCharsets.UTF_8);

This will make absolutely no difference to how the string is stored in memory.

If you want to store something in UTF-8, you cannot use String at all, but must use byte[] or some other wrapper around a byte array.

It is at least remotely possible to save memory by doing this, depending on the exact characteristics of the characters in your string -- most of your strings must have at least one non-Latin1 character, for example.

Sign up to request clarification or add additional context in comments.

Comments

1

By Default Java stores strings in UTF-16, in my application it is using huge memory.

  • This is incorrect. Java stores 1 byte per char unless the string contains non-ascii. In the at this point fairly distant past, it stores 2 bytes per char. Now, it only does that if your string is not something you can represent with 1 char.
  • Even then, we're talking about at worst an ~x2 factor: A humongous string with a single non-ascii char in it, would be smaller in UTF-8 than in java String, because it'd be stored as a UTF-16 due to the presence of that one non-ASCII character, but it's only one, so the UTF-8 variant wouldn't actually be large. That's.. rare. And '2x' does not match 'huge'.

new String(rs.getBytes("MY_COLUMNNAME"), StandardCharsets.UTF_8);

This would accomplish nothing. A String is a String. That charset encoding you specify merely tells the string constructor how to parse the bytes. Not how to store it - it stores it how it stores it, you can't affect it in any way.

I don't see any difference in the size

Because there isn't. Generally if you think of one simple trick to massively improve performance, it's snake oil. If something that simple and obvious exists with virtually no downside, it'd have been part of java long ago.

Stop worrying about performance issues. You aren't smart enough (JDK engineers have gone on record saying they aren't smart enough. So if they aren't, pretty much nobody is. I'm not) to know how that will affect things - generally it won't. Often a thing you write 'so that it will be faster' is often slower.

The things to keep in mind:

  • Java works on optimizers, and those optimizers are just big pattern matchers. They find patterns of code and then substitute them with known really fast ways to do those known patterns. Therefore, write code like your average java coder codes, because that is what the pattern matchers look for. Doing a thing commonly done in a certain way differntly because that way 'feels faster' is hence often slower. If the java community tends to do it in way X, do it in way X.

  • This isn't an exaggeration: On average, 99% of the CPU is spent running about 1% of the code. Optimizing anything but that 1% (called the 'hot path') is irrelevant. In fact, optimizing the hot path often involves modifying how data/code 'flows in' to the hot path and how it 'flows out', so giving yourself maximal room to modify code is important if you want to optimize the hot path, which is usually the only useful thing to do to make an app faster. Often, abstractions 'feel slower', so if you optimize things by eliminating abstractions, that actually makes changing things around to be exactly how the hot path wants it harder - hence, optimizing code that is not on the hot path makes things slower.

  • It's generally much more difficult than you'd think identifying the hot path. So, don't try with your eyeballs, you'll get it wrong. Get a profiler - a tool that will tell you, under real load, where most of the time is spent. Only after knowing the hot path would you begin optimizing it. One advantage of using a profiler is that you can rerun it after you make a modification to check that your code is actually faster, because very often it is not, or slower, because mere mortals simply cannot intuit.

  • These rules can be skipped if algorithmic complexity is the issue. A constant difference (such as the 2x constant difference between using UTF-16 vs. using UTF-8 even in the optimal case, cannot possibly be more than 2x difference) is by definition not an increase in algorithmic complexity. (Big O notation, if you've ever heard of it - that is talking about complexity).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.