0

In this code string x gives an OutOfMemoryException. Is there any other way that I can parse all the files without getting OutofMemoryException? There seems to be nothing wrong in the code I have tried.

Someone suggested to make the program read file by file rather than reading whole files and putting it in one string x.

IEnumerable<string> textLines = Directory.GetFiles(@"C:\Users\karansha\Desktop\Unique_Express\", "*.*")
    .Select(filePath => File.ReadLines(filePath))
    .SelectMany(line => line);

string x = string.Join(",", textLines);
List<string> users = new List<string>();
Regex regex = new Regex(@"User:\s*(?<username>.*?)\s");
MatchCollection matches = regex.Matches(x);
foreach (Match match in matches)
{
    var user = match.Groups["username"].Value;
    if (!users.Contains(user)) users.Add(user);
}
int numberOfUsers = users.Count(name => name.Length < 15); 
Console.WriteLine("Unique_Users_Express=" + numberOfUsers);
6
  • 1
    How many files are we talking here? Also are the files huge? Commented Mar 12, 2013 at 13:22
  • yes files size is huge. around 500 MB. Commented Mar 12, 2013 at 13:23
  • 2
    regardless of the size of the files, I also would suggest to process one file after the other... Commented Mar 12, 2013 at 13:23
  • 1
    Is that a winform app or a web app? Also can you explain the purpose of this code? Why do you need to read all lines and combine them into one string? Can't you run your match on one textline? so may be loop through each textline and perform the match check? Commented Mar 12, 2013 at 13:24
  • @ArnoSaxena how i can do this ? Can you help me with code ? Thanks Commented Mar 12, 2013 at 13:24

2 Answers 2

5

It seems odd that you would wish to join all the lines of each file together. Assuming usernames don't cross lines, you can do this in a single LINQ query in a much cleaner fashion:

var regex = new Regex(@"User:\s(?<username>[^\s]+)");
var path = @"C:\Users\karansha\Desktop\Unique_Express\";
var users = Directory.GetFiles(path, "*.*")
                     .Select(file => File.ReadLines(file))
                     .SelectMany(lines => lines)
                     .SelectMany(line => regex.Matches(line).Cast<Match>())
                     .Select(match => match.Groups["username"].Value)
                     .Distinct()
                     .ToList();

int numberOfUsers = users.Count(name => name.Length < 15); 
Console.WriteLine("Unique_Users_Express=" + numberOfUsers);

Hopefully each line of the query should be clear. This will process a single line at a time - and so long as you don't have so many users that the simple list of distinct usernames doesn't fit into memory, you should be fine. If you only need the count, you don't even need the call to ToList.

Note that I've adjusted the regular expression after a bit of experimenting - I hope that's okay for you.

Sign up to request clarification or add additional context in comments.

6 Comments

Casting MatchCollection to Match is like selecting the first item?
@Baboon: No, it's just transforming MatchCollection which implements the non-generic IEnumerable to an IEnumerable<Match> by casting each element within it.
So even though File.ReadLines provides an enumerable for the entire file, it's only going to read one line into memory at a time? Would that also be true if you issued that in a foreach?
@MichaelPerrenoud: Yes, exactly. That's the difference between File.ReadLines and File.ReadAllLines.
Oh, so File.ReadLines would then be issuing a yield on every line as you move through the list? You're awesome Jon, thanks a lot, I never come in contact with you and not learn something!
|
0

Try this: Assuming the usernames don't go to the other line, you can parse every line and build up the unique username. I have not tried to change your code as such. Just the logic of it.

        IEnumerable<string> textLines = Directory.GetFiles(@"C:\Users\karansha\Desktop\Unique_Express\", "*.*")
                                                 .Select(filePath => File.ReadLines(filePath))
                                                 .SelectMany(line => line);

        List<string> users = new List<string>();

        textLines.ToList().ForEach(textLine =>
        {
            Regex regex = new Regex(@"User:\s*(?<username>.*?)\s");
            MatchCollection matches = regex.Matches(textLine);
            foreach (Match match in matches)
            {
                var user = match.Groups["username"].Value;
                if (!users.Contains(user)) users.Add(user);
            }
        });

        int numberOfUsers = users.Count(name => name.Length < 15);
        Console.WriteLine("Unique_Users_Express=" + numberOfUsers);

5 Comments

That still pulls all the lines of text into a single list to start with. Why bother calling ToList and ForEach, when you could just use foreach (var textLine in textLines) and get streaming?
@JonSkeet The OP said that he is getting the outofmemory exception at string x =... which is when he is trying to join all lines from all files together ... and I do agree with your later line to use the traditional foreach ... instead of converting to ToList().
Yes, the OP is originally failing because of an even worse way of processing the data - but that's no reason to use ToList.
@JonSkeet like I said Jon, I do agree on your statement NOT to use the ToList() :)
I'm not sure what you disagree with in my original comment then. Note that the OP hadn't already tried to fetch all the data before the Join call, so it's entirely possible that just ToList will fail.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.