Java File Replace Lines

Question

I have a 250 GB big .txt file and i have just 50 GB space left on my harddrive. Every line in this .txt file has a long prefix and i want to delete this prefix to make that file smaller.

First i wanted to read line by line, change it and write it into another file.

// read line out of first file
line = line.replace(prefix, "");
// write line into second file

The Problem is i have not enough space for that.

So how can i delete all prefixes out out of my file?

Just so you know, it would be line = line.replace(prefix, "");. — christopher
– christopher, Commented Jan 15, 2014 at 9:24
How large is the text file compressed? I ask, because you could create a ZIP file and save the new file into it. Then remove the old file and unpack the ZIP when done? — Eric
– Eric, Commented Jan 15, 2014 at 9:31

Torben · Accepted Answer · 2014-01-15 10:23:41Z

Check RandomAccessFile: http://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html

You have to keep track of the position you are reading from and the position you are writing to. Initially both are at the start. Then you read N bytes (one line), shorten it, seek back N bytes and write M bytes (the shortened line). Then you seek forward (N - M) bytes to get back to the position where next line starts. Then you do this over and over again. In the end truncate excess with setLength(long).

You can also do it in batches (like read 4kb, process, write, repeat) to make it more efficient.

The process is identical in all languages. Some make it easier by hiding the seeking back and forth behind an API.

Of course you have to be absolutely sure that your program works flawlessly, since there is no way to undo this process.

Also, the RandomAccessFile is a bit limited, since it can not tell you at which position the file is at a given moment. Therefore you have to do conversion between "decoded strings" and "encoded bytes" as you go. If your file is in UTF-8, a given character in the string can take one ore many bytes in the file. So you can't just do seek(string.length()). You have to use seek(string.getBytes(encoding).length) and factor in possible line break conversions (Windows uses two characters for line break, Unix uses only one). But if you have ASCII, ISO-Latin-1 or similar trivial character encoding and know what line break chars the file has, then the problem should be pretty simple.

And as I edit my answer to match all possible corner cases, I think it would be better to read the file using BufferedReader and correct character encoding and also open a RandomAccessFile for doing the writing. If your OS supports having a file being opened twice. This way you would get complete Unicode support from BufferedReader and yuou wouldn't have to keep track of read and write positions. You have to do the writing with RandomAccessFile because using a Writer to the file may just truncate it (haven't tried it, though).

Something like this. It works on trivial examples but it has no error checking and I absolutely give no guarantees. Test it on a smaller file first.

public static void main(String[] args) throws IOException {
    File f = new File(args[0]);
    BufferedReader reader = new BufferedReader(new InputStreamReader(
            new FileInputStream(f), "UTF-8")); // Use correct encoding here.
    RandomAccessFile writer = new RandomAccessFile(f, "rw");

    String line = null;
    long totalWritten = 0;
    while ((line = reader.readLine()) != null) {
        line = line.trim() + "\n"; // Remove your prefix here.

        byte[] b = line.getBytes("UTF-8");
        writer.write(b);
        totalWritten += b.length;
    }

    reader.close();

    writer.setLength(totalWritten);
    writer.close();
}

Actually you can get and set the position of RandomAccessFile via getChannel().position(...) but it still doesn't remove the lack of unicode support...

treeno · Accepted Answer · 2014-01-15 09:42:04Z

0

You can use RandomAccessFile. That allows you to overwrite parts of the file. And since there is no copy- or caching-mechanism mentioned in the javadoc this should work without additional disk-space.

So you could overwrite the unwanted parts with spaces.

answered Jan 15, 2014 at 9:42

treeno

2,6002 gold badges23 silver badges39 bronze badges

Comments

user16083509 · Accepted Answer · 2021-12-13 12:40:13Z

0

Split the 250 GB file into 5 files of 50 GB each. Then process each file and then delete it. This way you will always have 50 GB left on your machine and you will also be able to process 250 GB file.

answered Dec 13, 2021 at 12:40

user16083509

1 Comment

user8681 Over a year ago

How do you do that with only 50GB available? This just seems to move the problem.

Community · Accepted Answer · 2017-05-23 11:45:30Z

-1

Since it does not have to be done in Java, i would recommend Python for this:

Save the following in replace.py in the same folder with your textfile:

import fileinput

for line in fileinput.input("your-file.txt", inplace=True):
    print "%s" % (line.replace("oldstring", "newstring"))

replace the two strings with your string and execute python replace.py

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Jan 15, 2014 at 9:41

Martin Seeler

7,0023 gold badges36 silver badges46 bronze badges

3 Comments

Torben Over a year ago

I'm no python expert, but doesn't that move the original file to a backup and write to a new file with the original name, requiring 2x space (which OP doesn't have)? How else would it handle the case where the modified line is longer than the original one?

Martin Seeler Over a year ago

Yes, but he said he has no space for this file on his disk, so basicaly he needs to do this task on another machine, or am i missing something?

Torben Over a year ago

Yes. Random access files. :)

Collectives™ on Stack Overflow

Java File Replace Lines

4 Answers 4

1 Comment

Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related