Java string comparison

Question

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?

for (int s = 0; s < txtA.TokenContainer.size(); s++) {
    String strTxtA = txtA.getSubStr(s);
    strLengthA = txtA.getNumToken(s);

    if (strLengthA >= dp.getMinStrLength()) {
        int tokenFileB = 1;

        for (int t = 0; t < txtB.TokenContainer.size(); t++) {
            String strTxtB = txtB.getSubStr(t);
            strLengthB = txtB.getNumToken(t);

            if (strTxtA.equalsIgnoreCase(strTxtB)) {
                try {
                    subStrTemp = new SubStrTemp(
                        txtA.ID, txtB.ID, tokenFileA, tokenFileB,
                        (tokenFileA + strLengthA - 1), 
                        (tokenFileB + strLengthB - 1));

                    if (subStrContainer.contains(subStrTemp) == false) {
                        subStrContainer.addElement(subStrTemp);
                    }
                } catch (Exception ex) {
                    logger.error("error");
                }
            }
            tokenFileB += strLengthB;
        }
        tokenFileA += strLengthA;
    }
}

Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

Can you describe in words or with an example what your code does? — Mark Byers
– Mark Byers, Commented Sep 5, 2010 at 20:23

Community · Accepted Answer · 2017-05-23 12:19:48Z

7

Your main problem is that you go through all txtB for each token in txtA.

You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.

On the same topic :

edited May 23, 2017 at 12:19

CommunityBot

11 silver badge

answered Sep 5, 2010 at 20:29

Colin Hebert

93.5k15 gold badges164 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

jackdaniels Over a year ago

Thank you Colin HEBERT, "nested" -> "for(){ for(){} }, ""not nested" -> "for(){}, for(){}" right? Hashmap I am really afraid of.. never code coded it before. Since i know in HashMap I have to use HashSet and here redundant tokens become removed!? Ok, I don't need them, but I need their positions. Can you tell me pls, if I can store and retrieve token positions with HashMap?

Colin Hebert Over a year ago

It's exacly this for the nested/not nested. If you want to keep the positions, you can do this HashMap<String, List<Integer>> so you can have for each word a list of its position. Or better, instead of Integer your own structure with filename, position and other informations.

jackdaniels Over a year ago

Huh... think I implemented your suggestion Colin. But somehow unable to get Hashmap paramters.. Can you have a look pls, programming code is here <pastebin.com/wScB5RSy>

jackdaniels Over a year ago

Wow! Your code is just beatifull!! How many years programming expirience are behind this style!? It seems to be a HaschJoin, that "meriton" suggested me before. Right?

Colin Hebert Over a year ago

It is kind of like the "HashJoin" of @meriton. But I kept your code, so it doesn't remove punctuation and doesn't compare with lower case words.

|

meriton · Accepted Answer · 2010-09-05 20:29:25Z

2

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

answered Sep 5, 2010 at 20:29

meriton

70.9k14 gold badges117 silver badges190 bronze badges

5 Comments

jackdaniels Over a year ago

Hello Meriton, Thank You for helping as well. Yes I did it, but don't wanna more. Performance was also ok with small strings, another reason with nested loops I was able to store strPositions(of same substrings) (almost) sorted in a Vector.

jackdaniels Over a year ago

Yes I got it already there is no way around Hashing and Mapping.. I have to lear it..:-( Can you tell me pls, how can I do a HashJoin in Java? Didn't find any java related example -especially HashJoin in google... And if do HashJoin how can I store subStr-positons, these are necessary to store.

jackdaniels Over a year ago

Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really understood.. Thanks in advance. Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really undertood..

James Over a year ago

lowercase is so that your program will consider "Token" and "token" to be the same word.

jackdaniels Over a year ago

Yes, I know what lowercase does. But what kind benefits I will get if using in (Hash)map related context? ( ... Interger!=integer... )

James · Accepted Answer · 2010-09-05 20:29:50Z

0

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

answered Sep 5, 2010 at 20:29

James

9,3173 gold badges37 silver badges50 bronze badges

2 Comments

jackdaniels Over a year ago

Thanks James, which data structure would you suggest to use?

jackdaniels Over a year ago

More comments, with pleasure.. I am reading txt files with help of java tokenizer into a string and then trying to search substring of DocA in DocB. Doing this in 2 cases. 1st case substr-length is constat, 2nd case substr-length vary, for this reason i added "if (strLengthA >= dp.getMinStrLength())" to reduce iteretion for very short substrings.

roni bar yanai · Accepted Answer · 2010-09-05 21:06:31Z

0

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.

For second order improvements try to call less to functions, for example you can get the size once

sizeB = txtB.TokenContainer.size();

Depeneds on the size, you may call the container once to get an array of strings to save the getStr....

Roni

answered Sep 5, 2010 at 21:06

roni bar yanai

1,48210 silver badges12 bronze badges

1 Comment

jackdaniels Over a year ago

Thanks Roni, I was not sure if calling functions will take some performance. But of course, especially "txtB.TokenContainer.size();" programm calls every n-times, absolutly unnecessary.

Collectives™ on Stack Overflow

Java string comparison

4 Answers 4

10 Comments

5 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

10 Comments

5 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related