1

I wrote a C# console application to read, and find the most common word in a text file. My application works for small text files but not large text files. A friend told me about algorithm complexity analysis, something I did know about in the past that I'm still learning about. Now I'm trying to figure out what's wrong with my code, why does it only work for small text files. Here's my code pasted below:

static void Main(string[] args)
        {
            String line = "";
            String word = "";
            int count = 0;
            int maxCount = 0;
            string[] string1 = new string[0];
            ArrayList words = new ArrayList();

            var path = ConfigurationSettings.AppSettings["inputFilePath"];

            if(path == null)
            {
                throw new Exception();
            }

            var watch = System.Diagnostics.Stopwatch.StartNew();

            try
            {
                //opens text file read mode
                using(StreamReader file = new StreamReader(path))
                {
                    while((line = file.ReadLine()) != null)
                    {
                        //separates and adds each word into an array
                        string1 = line.ToLower().Split(new char[] { ',', '.', ' ' }, StringSplitOptions.RemoveEmptyEntries);

                        //adds the words from the array into a list
                        foreach(String s in string1)
                        {
                            words.Add(s);
                        }
                    }

                    //looks for the most common word in the list
                    for(int i = 0; i < words.Count; i++)
                    {
                        count = 1;

                        //counts each word and stores value to count variable
                        for(int j = i + 1; j < words.Count; j++)
                        {
                            if(words[i].Equals(words[j]))
                            {
                                count++;
                            }
                        }

                        //if count > maxCount, count value stored in maxCount, and corresponding word to word variable
                        if(count > maxCount)
                        {
                            maxCount = count;
                            word = (String)words[i];
                        }
                    }

                    Console.WriteLine("The most common word is " + word + ", with " + maxCount + " occurrences.");
                    file.Close();
                }
            }
            catch(Exception e)
            {
                Console.WriteLine("There was an error opening the file: " + e.Message);
            }

            watch.Stop();
            var elapseMs = watch.ElapsedMilliseconds;

            Console.WriteLine("File processed in " + elapseMs + " milliseconds");
        }
7
  • 2
    This might be better suited to Code Review since it's working code. I would use Dictionary<string, int> to store the words. Commented Feb 5, 2020 at 9:13
  • "but not large text files" This is not a technical description of a problem. You need to go back to first principles and debug your code Commented Feb 5, 2020 at 9:13
  • 3
    Algorithm complexity has little to do with why a program works or not. What's the problem? What does why does it only work for small text files mean? I can definitely see inefficient collection usage, variables with too-wide a scope (what's string1 doing up there?) but is that the real problem? Commented Feb 5, 2020 at 9:13
  • What do you mean it does not work for large files - what exactly goes wrong ? Commented Feb 5, 2020 at 9:14
  • For example, string1 is only used inside the first loop, and yet it's decalred at the top, which gives it method-wide scope. Its initial value is never used either. This could be a simple var strings=line.Split(...); Commented Feb 5, 2020 at 9:17

1 Answer 1

8

We tend to talk about algorithm complexity with reference to Big O Notation.

Your algorithm is O(n²) which is to say as the number of words in the file increases the time/complexity squares. Why? simply within the loop of words, you loop through the words again.

Your algorithm can be made much, much better.

In most programming languages, looking up in a hashset is close to O(1), which is to say it doesnt matter whether there are 1 or a thousand entries in the hashset, it takes (roughly) the same amount of time to look it up by key. This is your clue! If you store the words you have seen on the first pass of the file, there is no need for the double loop. All you do is:

  • loop once through the words
  • If the word exists in the dictionary increment the value by 1
  • if the word doesn't exist in the dictionary, add it with the word as the key and 1 as the value

In this way the algorithm is only O(n) which is to say the time/complexity grows linearly with the number of words, and not by its square as your current implementation does.

The most common word is then simply the key which has the highest value. This requires you to scan through the entre dictionary, but again that is only O(n) so at worst your algorithm becomes O(2n) but that still a vast improvment on the original.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the clue. My application now works. I used the Dictionary Class to find the most common string in my list.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.