9

I have to count the number of unique words from a text document using Java. First I had to get rid of the punctuation in all of the words. I used the Scanner class to scan each word in the document and put in an String ArrayList.

So, the next step is where I'm having the problem! How do I create a method that can count the number of unique Strings in the array?

For example, if the array contains apple, bob, apple, jim, bob; the number of unique values in this array is 3.


public countWords() {
    try {
        Scanner scan = new Scanner(in);
        while (scan.hasNext()) {
            String words = scan.next();
            if (words.contains(".")) {
                words.replace(".", "");
            }
            if (words.contains("!")) {
                words.replace("!", "");
            }
            if (words.contains(":")) {
                words.replace(":", "");
            }
            if (words.contains(",")) {
                words.replace(",", "");
            }
            if (words.contains("'")) {
                words.replace("?", "");
            }
            if (words.contains("-")) {
                words.replace("-", "");
            }
            if (words.contains("‘")) {
                words.replace("‘", "");
            }
            wordStore.add(words.toLowerCase());
        }
    } catch (FileNotFoundException e) {
        System.out.println("File Not Found");
    }
    System.out.println("The total number of words is: " + wordStore.size());
}
2
  • Are there any restrictions to what you can or can't use? Commented Oct 4, 2012 at 3:50
  • no their are no restrictions! Commented Oct 4, 2012 at 3:51

9 Answers 9

25

Are you allowed to use Set? If so, you HashSet may solve your problem. HashSet doesn't accept duplicates.

HashSet noDupSet = new HashSet();
noDupSet.add(yourString);
noDupSet.size();

size() method returns number of unique words.

If you have to really use ArrayList only, then one way to achieve may be,

1) Create a temp ArrayList
2) Iterate original list and retrieve element
3) If tempArrayList doesn't contain element, add element to tempArrayList
Sign up to request clarification or add additional context in comments.

3 Comments

Yes, I'm allowed to use HashSet. Can you please show me how to use HashSet?
I don't have to ArrayList only, I can use anything that works. Can i instatiate a new HashSet and add all the string values from the ArrayList?
Yes, you can (or) you can directly add elements to Set, that way you don't even need ArrayList.
19

Starting from Java 8 you can use Stream:

After you add the elements in your ArrayList:

long n = wordStore.stream().distinct().count();

It converts your ArrayList to a stream and then it counts only the distinct elements.

Comments

3

I would advice to use HashSet. This automatically filters the duplicate when calling add method.

Comments

2

Although I believe a set is the easiest solution, you can still use your original solution and just add an if statement to check if value already exists in the list before you do your add.

if( !wordstore.contains( words.toLowerCase() )
   wordStore.add(words.toLowerCase());

Then the number of words in your list is the total number of unique words (ie: wordStore.size() )

2 Comments

Thanks for you help! - Isn't HashSet more efficient because it doesn't allow previous values by default.
Absolutely it should be. However, I wanted to give you an option that wouldn't cause you to change your existing code. Really, you were just missing an "if" statement.
1

This general purpose solution takes advantage of the fact that the Set abstract data type does not allow duplicates. The Set.add() method is specifically useful in that it returns a boolean flag indicating the success of the 'add' operation. A HashMap is used to track the occurrence of each original element. This algorithm can be adapted for variations of this type of problem. This solution produces O(n) performance..

public static void main(String args[])
{
  String[] strArray = {"abc", "def", "mno", "xyz", "pqr", "xyz", "def"};
  System.out.printf("RAW: %s ; PROCESSED: %s \n",Arrays.toString(strArray), duplicates(strArray).toString());
}

public static HashMap<String, Integer> duplicates(String arr[])
{

    HashSet<String> distinctKeySet = new HashSet<String>();
    HashMap<String, Integer> keyCountMap = new HashMap<String, Integer>();

    for(int i = 0; i < arr.length; i++)
    {
        if(distinctKeySet.add(arr[i]))
            keyCountMap.put(arr[i], 1); // unique value or first occurrence
        else
            keyCountMap.put(arr[i], (Integer)(keyCountMap.get(arr[i])) + 1);
    }     

    return keyCountMap; 
} 

RESULTS:

RAW: [abc, def, mno, xyz, pqr, xyz, def] ; PROCESSED: {pqr=1, abc=1, def=2, xyz=2, mno=1}

3 Comments

Are you actually quoting something? If you're not, don't use quote formatting. If you are quoting something, you need to properly attribute it.
This 4 years old question already has an answer using HashSet for O(1) performance. Your algorithm for counting occurrences of words in a String array, does not answer OP's question (you're not counting unique values in an ArrayList); nor does it improve the current solution. Maybe you misunderstood the question?
Thanks for the feedback. I apologize for the confusion. I simply wanted to share a solution for counting distinct elements in an array that I thought was interesting/different, and could perhaps be useful to someone else in the future who may be researching solutions to a similar problem. I probably should have added the solution to a more appropriate thread.
0

You can create a HashTable or HashMap as well. Keys would be your input strings and Value would be the number of times that string occurs in your input array. O(N) time and space.

Solution 2:

Sort the input list. Similar strings would be next to each other. Compare list(i) to list(i+1) and count the number of duplicates.

Comments

0

In shorthand way you can do it as follows...

    ArrayList<String> duplicateList = new ArrayList<String>();
    duplicateList.add("one");
    duplicateList.add("two");
    duplicateList.add("one");
    duplicateList.add("three");

    System.out.println(duplicateList); // prints [one, two, one, three]

    HashSet<String> uniqueSet = new HashSet<String>();

    uniqueSet.addAll(duplicateList);
    System.out.println(uniqueSet); // prints [two, one, three]

    duplicateList.clear();
    System.out.println(duplicateList);// prints []


    duplicateList.addAll(uniqueSet);
    System.out.println(duplicateList);// prints [two, one, three]

2 Comments

Personally, I don't understand why I would use your shorthand method. I could just create loop to add the String values inside the HashSet; the HashSet doesn't allow previous values by default.
Here I have mentioned away to extract the unique vaues of an array list. Thought the shorthand method is handier to use. But it is your preference to select the best methos... :)
0
public class UniqueinArrayList {

    public static void main(String[] args) { 
        StringBuffer sb=new StringBuffer();
        List al=new ArrayList();
        al.add("Stack");
        al.add("Stack");
        al.add("over");
        al.add("over");
        al.add("flow");
        al.add("flow");
        System.out.println(al);
        Set s=new LinkedHashSet(al);
        System.out.println(s);
        Iterator itr=s.iterator();
        while(itr.hasNext()){
            sb.append(itr.next()+" ");
        }
        System.out.println(sb.toString().trim());
    }

}

Comments

0

3 distinct possible solutions:

  1. Use HashSet as suggested above.

  2. Create a temporary ArrayList and store only unique element like below:

    public static int getUniqueElement(List<String> data) {
        List<String> newList = new ArrayList<>();
        for (String eachWord : data)
        if (!newList.contains(eachWord))
            newList.add(eachWord);
        return newList.size();
    }
    
  3. Java 8 solution

    long count = data.stream().distinct().count();
    

1 Comment

I strongly advise against method 2. It is very inefficient compared to methods 1 and 3, particularly as the size of the list becomes larger. Method 2 is O(n^2) versus methods 1 and 3 which are just O(n). This is because the call to newList.contains is O(n) and that call is itself within a loop which is also O(n), thus making the overall complexity O(n^2).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.