3

I have a large text file(about 20 million lines) which has lines in the following format :

<string1>, <string2>

Now those strings may have trailing or leading whitespaces which I want to remove on reading the file.

I am currently using trim() for this purpose but since String in Java is immutable, trim() is creating a new object per trim operation. This is leading to too much wastage of memory.

How can I do it better?

6
  • 3
    Please show how you are reading the file and then splitting the strings. Commented Feb 3, 2017 at 11:11
  • 1
    You do realize that any unused Strings are collected, so there's no real waste of memory, just new created objects (which are efficiently collected by the GC). Commented Feb 3, 2017 at 11:16
  • I am not quite sure but I think using sed could solve the problem Commented Feb 3, 2017 at 11:16
  • 1
    Show the code that you are using to read in the file; with almost complete certainty, trim() will turn out not to be the main memory bottleneck. Commented Feb 3, 2017 at 11:38
  • Split your String with comma separator and then , Append each String using StringBuilder .So String not created each time as you said . Commented Feb 3, 2017 at 11:48

7 Answers 7

2

I would be surprised if the immutable String class is causing problems; the JVM is very efficient and the result of many years of engineering work.

That said, Java does provide a mutable class for manipulating strings called StringBuilder. You can read the docs here.

If you are working across threads, consider using StringBuffer.

Sign up to request clarification or add additional context in comments.

Comments

0

You can read your string as a stream of characters, and record the start and end position of each token you want to parse.

This still creates an object per token, but if your tokens are relatively long, the two int fields your object will contain are much smaller than the corresponding string would be.

But before you embark on that journey, you should probably just make sure you don't keep your trimmed strings for more time than it is needed.

Comments

0

Assuming you have a String containing <string1>, <string2>, and you just want to split it without maybe trimming the parts:

String trimmedBetween(String str, int start, int end) {
  while (start < end && Character.isWhitespace(str.charAt(start)) {
    ++start;
  }

  while (start < end && Character.isWhitespace(str.charAt(end - 1)) {
    --end;
  }

  return str.substring(start, end);
}

(Note this is basically how String.trim() is implemented, just with start and end instead of 0 and length)

Then call like:

int commaPos = str.indexOf(',');
String firstString = trimmedBetween(str, 0, commaPos);
String secondString = trimmedBetween(str, commaPos + 1, str.length());

3 Comments

I do want to trim the parts i.e. the individual strings.
Why would I ever want to use this trim instead of the default one? The goal was to avoid memory waste, but you use the same extra memory (= you return a new string) as the built-in trim()
Because String.trim() only trims from the beginning and end of the string. To use that you have to split the string (creates an array, and two strings), then trim them (up to two more strings). This approach creates exactly two Strings, instead of 4 Strings and an array.
0

As you already noticed, Strings are immutable. So the solution is to not use String, but rather something that is mutable. StringBuffer is a suitable class.

However, StringBuffer does not include a trim method, so you can use something like:

void trim(StringBuffer sb) {
    int start = 0;
    while (sb.length() > start && Character.isWhitespace(sb.charAt(0))) {
        start++;
    }
    sb.delete(0, start - 1);

    int end = 0;
    while (sb.length() > end && Character.isWhitespace(sb.charAt(sb.length() - 1))) {
        end++;
    }
    sb.delete(sb.length() - end, sb.length() - 1);
}

Comments

0

If you want to avoid String then you have to handle it yourself using char and StringBuilder, like this:

public class Test {
    public static void main(String... args) throws Exception {
        InputStreamReader in = new InputStreamReader(new FileInputStream("<testfile>"), "UTF-8");

        char[] buffer = new char[32768];
        int read = -1;
        int index;
        StringBuilder content = new StringBuilder();
        while ((read = in.read(buffer)) > -1) {
            content.append(buffer, 0, read);
            index = 0;
            while (index > -1) {
                index = content.indexOf("\n");
                if (index > -1) {
                    char[] temp = new char[index];
                    content.getChars(0, index, temp, 0);
                    handleLine(temp);
                    content.replace(0, index + 1, "");
                }
            }
        }

        in.close();
    }

    private static void handleLine(char[] line) {
        StringBuilder content = new StringBuilder().append(line);
        int start = 0;
        int end = content.length();
        if (end > 0) {
            char ch = content.charAt(0);
            while (Character.isWhitespace(content.charAt(start))) {
                start++;
                if (end <= start) {
                    break;
                }
            }
            if (start < end) {
                while (Character.isWhitespace(content.charAt(end - 1))) {
                    end--;
                    if (end <= start) {
                        break;
                    }
                }
            }
        }

        System.out.println("***" + content.subSequence(start, end) + "***");
    }
}

Comments

0

We could handle by Regex.

   {
    String str = "abcd, efgh";
    String [] result = str.split("(,\\s)|,");
    Arrays.asList(result).forEach(s -> System.out.println(s));
   }

Comments

-1

i think you can directly write the result data to a new file.

String originStr = "   xxxxyyyy";
for (int i = 0; i < originStr.length(); i++) {
    if (' ' == originStr.charAt(i)) {
        continue;
    }
    NewFileOutPutStream.write(originStr.charAt(i));
}

2 Comments

if u using m-thread model, you can separated your file, let them to be few chunk file for logical, and then above method is also worked well.
Writing a single char at a time will take forever. You need to buffer it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.