6

I have a 250 GB big .txt file and i have just 50 GB space left on my harddrive. Every line in this .txt file has a long prefix and i want to delete this prefix to make that file smaller.

First i wanted to read line by line, change it and write it into another file.

// read line out of first file
line = line.replace(prefix, "");
// write line into second file

The Problem is i have not enough space for that.

So how can i delete all prefixes out out of my file?

9
  • 1
    Just so you know, it would be line = line.replace(prefix, "");. Commented Jan 15, 2014 at 9:24
  • I know but thank you. That is not my Problem ;) Commented Jan 15, 2014 at 9:26
  • 1
    Yeah I Know; that's why I up voted :) Commented Jan 15, 2014 at 9:28
  • How large is the text file compressed? I ask, because you could create a ZIP file and save the new file into it. Then remove the old file and unpack the ZIP when done? Commented Jan 15, 2014 at 9:31
  • 1
    Do you absolutely have to do it only in Java? Commented Jan 15, 2014 at 9:32

4 Answers 4

9

Check RandomAccessFile: http://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html

You have to keep track of the position you are reading from and the position you are writing to. Initially both are at the start. Then you read N bytes (one line), shorten it, seek back N bytes and write M bytes (the shortened line). Then you seek forward (N - M) bytes to get back to the position where next line starts. Then you do this over and over again. In the end truncate excess with setLength(long).

You can also do it in batches (like read 4kb, process, write, repeat) to make it more efficient.

The process is identical in all languages. Some make it easier by hiding the seeking back and forth behind an API.

Of course you have to be absolutely sure that your program works flawlessly, since there is no way to undo this process.

Also, the RandomAccessFile is a bit limited, since it can not tell you at which position the file is at a given moment. Therefore you have to do conversion between "decoded strings" and "encoded bytes" as you go. If your file is in UTF-8, a given character in the string can take one ore many bytes in the file. So you can't just do seek(string.length()). You have to use seek(string.getBytes(encoding).length) and factor in possible line break conversions (Windows uses two characters for line break, Unix uses only one). But if you have ASCII, ISO-Latin-1 or similar trivial character encoding and know what line break chars the file has, then the problem should be pretty simple.

And as I edit my answer to match all possible corner cases, I think it would be better to read the file using BufferedReader and correct character encoding and also open a RandomAccessFile for doing the writing. If your OS supports having a file being opened twice. This way you would get complete Unicode support from BufferedReader and yuou wouldn't have to keep track of read and write positions. You have to do the writing with RandomAccessFile because using a Writer to the file may just truncate it (haven't tried it, though).

Something like this. It works on trivial examples but it has no error checking and I absolutely give no guarantees. Test it on a smaller file first.

public static void main(String[] args) throws IOException {
    File f = new File(args[0]);
    BufferedReader reader = new BufferedReader(new InputStreamReader(
            new FileInputStream(f), "UTF-8")); // Use correct encoding here.
    RandomAccessFile writer = new RandomAccessFile(f, "rw");

    String line = null;
    long totalWritten = 0;
    while ((line = reader.readLine()) != null) {
        line = line.trim() + "\n"; // Remove your prefix here.

        byte[] b = line.getBytes("UTF-8");
        writer.write(b);
        totalWritten += b.length;
    }

    reader.close();

    writer.setLength(totalWritten);
    writer.close();
}
Sign up to request clarification or add additional context in comments.

1 Comment

Actually you can get and set the position of RandomAccessFile via getChannel().position(...) but it still doesn't remove the lack of unicode support...
0

You can use RandomAccessFile. That allows you to overwrite parts of the file. And since there is no copy- or caching-mechanism mentioned in the javadoc this should work without additional disk-space.

So you could overwrite the unwanted parts with spaces.

Comments

0

Split the 250 GB file into 5 files of 50 GB each. Then process each file and then delete it. This way you will always have 50 GB left on your machine and you will also be able to process 250 GB file.

1 Comment

How do you do that with only 50GB available? This just seems to move the problem.
-1

Since it does not have to be done in Java, i would recommend Python for this:

Save the following in replace.py in the same folder with your textfile:

import fileinput

for line in fileinput.input("your-file.txt", inplace=True):
    print "%s" % (line.replace("oldstring", "newstring"))

replace the two strings with your string and execute python replace.py

3 Comments

I'm no python expert, but doesn't that move the original file to a backup and write to a new file with the original name, requiring 2x space (which OP doesn't have)? How else would it handle the case where the modified line is longer than the original one?
Yes, but he said he has no space for this file on his disk, so basicaly he needs to do this task on another machine, or am i missing something?
Yes. Random access files. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.