2

How do I partition a String to extract all the words/terms that occur in it and count how many times each occurs? For example let: String q = "foo bar foo" I want a DS {<foo,2>, <bar,1>}. This is the least verbose code I code come with*. Faults or less verbose alternatives?

String[] split = q.toString().split("\\s");
        Map<String, Integer> terms = new HashMap<String, Integer>();

        for (String term : split) {
            if(terms.containsKey(term)){
                terms.put(term, terms.get(term)+1);
            }
        }

(haven't compiled it)

1
  • 3
    You're close. Just add an else (if the term is not in the map already) Commented Aug 29, 2011 at 8:46

3 Answers 3

5

Modified code:

String[] split = q.toString().split("\\s");
Map<String, Integer> terms = new HashMap<String, Integer>();

for (String term : split) {
    int score = 0;
    if(terms.containsKey(term)){
        score = terms.get(term);
    }

    terms.put(term, score +1);
}

PS: Untested.

Sign up to request clarification or add additional context in comments.

Comments

0

I would go with the code suggested by Elite Gentleman, but I'm just putting this as a discussion point: What about using StringTokenizer? If scalability/performance was an issue, would tokenizer perform better? You may have to loop throught the string only once in that case, as opposed to doing the regex split first and then another traverse through the array.

Something like this:

StringTokenizer st = new StringTokenizer(s);
HashMap<String, Integer> terms = new HashMap<String, Integer>();

while (st.hasMoreElements()) {

    String term = st.nextToken();
    int score = 0;
    if(terms.containsKey(term)){
        score = terms.get(term);
    }

    terms.put(term, score +1);
}

I know that StringTokenizer, thought not deprecated, is a Legacy class according to java docs and it's use is not recommended:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

However I wonder if in this case for a simple token it gives more performant results.

Any thoughts?

Comments

0

Using Java 8 :

    String name = "anandha";
     name.chars()   //returns IntStream 
    .mapToObj(ch -> (char)ch) //returns Stream<Character>
    .collect(Collectors.groupingBy(ch -> ch, Collectors.counting())) //returns  Map<Character, Long>
    .forEach((k, v)->{
        System.out.println(k+ " : " + v);
    });

Output:

 a : 3
 d : 1
 h : 1
 n : 2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.