2
var fileList = Directory.GetFiles("./", "split*.dat");
int fileCount = fileList.Length;
int i = 0;
foreach (string path in fileList)
{
    string[] contents = File.ReadAllLines(path); // OutOfMemoryException
    Array.Sort(contents);
    string newpath = path.Replace("split", "sorted");
    File.WriteAllLines(newpath, contents);
    File.Delete(path);
    contents = null;
    GC.Collect();

    SortChunksProgressChanged(this, (double)i / fileCount);
    i++;
}

And for file that consists ~20-30 big lines(every line ~20mb) I have OutOfMemoryException when I perform ReadAllLines method. Why does this exception raise? And how do I fix it? P.S. I use Mono on MacOS

7
  • Don't ReadAllLines then, read it in a line at a time? Although a 20mb line is crazy :/ Commented Mar 7, 2011 at 12:28
  • I must implement external sort and show my teacher that it works for files bigger than my RAM. And I think that for file that consists only 200-300 lines and each line is 20 mb, sorting algorithm will be faster. Commented Mar 7, 2011 at 12:33
  • What sort of data is it? Would your sort be sufficient if you only read the first x characters of every line and sorted on those? Commented Mar 7, 2011 at 12:45
  • 1
    20-30 or 200-300 lines per file? Commented Mar 7, 2011 at 13:07
  • 2
    Well, we are here to tell you yuo are on a dead track. The file has been purposely designed so that this does not work ;) Commented Mar 7, 2011 at 14:14

3 Answers 3

7

You should always be very careful about performing operations with potentially unbounded results. In your case reading a file. As you mention, the file size and or line length is unbounded.

The answer lies in reading 'enough' of a line to sort then skipping characters until the next line and reading the next 'enough'. You probably want to aim to create a line index lookup such that when you reach an ambiguous line sorting order you can go back to get more data from the line (Seek to file position). When you go back you only need to read the next sortable chunk to disambiguate the conflicting lines.

You may need to think about the file encoding, don't go straight to bytes unless you know it is one byte per char.

The built in sort is not as fast as you'd like.

Side Note:

  1. If you call GC.* you've probably done it wrong
  2. setting contents = null does not help you
  3. If you are using a foreach and maintaining the index then you may be better with a for(int i...) for readability
Sign up to request clarification or add additional context in comments.

Comments

2

Okay, let me give you a hint to help you with your home work. Loading the complete file into memory will -as you know- not work, because it is given as a precondition of the assignment. You need to find a way to lazily load the data from disk as you go and throw as much data away as soon as possible. Because single lines could be too big, you will have to do this one char at a time.

Try creating a class that represents an abstraction over a line, for instance by wrapping the starting index and ending index of that line. When you let this class implement IComparable<T> it allows you to sort that line with other lines. Again, the trick is to be able to read characters from the file one at a time. You will need to work with Streams (File.Open) directly.

When you do this, you will be able to write your application code like this:

List<FileLine> lines = GetLines("fileToSort.dat");

lines.Sort();

foreach (var line in lines)
{
    line.AppendToFile("sortedFile.dat");
}

Your task will be to implement GetLines(string path) and create the FileLine class.

Note that I assume that the actual number of lines will be small enough that the List<FileLine> will fit into memory (which means an approximate maximum of 40,000,000 lines). If the amount of lines can be higher, you would even need a more flexible approach, but since you are talking about 20 to 30 lines, this should not be a problem.

Comments

-1

Basically you rapproach is bull. You are violatin a constraint of the homework you are given, and this constraint has been put there to make you think more.

As you said:

I must implement external sort and show my teacher that it works for files bigger than my RAM

Ok, so how you think you will ever read the file in ;) this is there on purpose. ReadAllLiens does NOT implement incremental external sort. As a result, it blows.

2 Comments

You are right, but the way you phrase your answer is a clear troll-bait.
@skolima: That's how TomTom rolls :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.