2

I am comparing substrings in two large text files. Very simple, tokenizing into two token containers, comparing with 2 for loops. Performance is disastrous! Does anybody have an advice or idea how to improve performance?

for (int s = 0; s < txtA.TokenContainer.size(); s++) {
    String strTxtA = txtA.getSubStr(s);
    strLengthA = txtA.getNumToken(s);

    if (strLengthA >= dp.getMinStrLength()) {
        int tokenFileB = 1;

        for (int t = 0; t < txtB.TokenContainer.size(); t++) {
            String strTxtB = txtB.getSubStr(t);
            strLengthB = txtB.getNumToken(t);

            if (strTxtA.equalsIgnoreCase(strTxtB)) {
                try {
                    subStrTemp = new SubStrTemp(
                        txtA.ID, txtB.ID, tokenFileA, tokenFileB,
                        (tokenFileA + strLengthA - 1), 
                        (tokenFileB + strLengthB - 1));

                    if (subStrContainer.contains(subStrTemp) == false) {
                        subStrContainer.addElement(subStrTemp);
                    }
                } catch (Exception ex) {
                    logger.error("error");
                }
            }
            tokenFileB += strLengthB;
        }
        tokenFileA += strLengthA;
    }
}

Generally my code reading two large Strings with Java Tokonizer into containers A and B. And then trying to compare substrings.Possision of Substrgs which are existing in both strings to store into a Vector. But performance is awful, also don't really know how to solve it with HashMap.

1
  • 2
    Can you describe in words or with an example what your code does? Commented Sep 5, 2010 at 20:23

4 Answers 4

7

Your main problem is that you go through all txtB for each token in txtA.

You should store informations on token from txtA (in a HashMap for instance) and then in a second loop (but not a nested one) you compare the strings with the existing one in the Map.


On the same topic :

Sign up to request clarification or add additional context in comments.

10 Comments

Thank you Colin HEBERT, "nested" -> "for(){ for(){} }, ""not nested" -> "for(){}, for(){}" right? Hashmap I am really afraid of.. never code coded it before. Since i know in HashMap I have to use HashSet and here redundant tokens become removed!? Ok, I don't need them, but I need their positions. Can you tell me pls, if I can store and retrieve token positions with HashMap?
It's exacly this for the nested/not nested. If you want to keep the positions, you can do this HashMap<String, List<Integer>> so you can have for each word a list of its position. Or better, instead of Integer your own structure with filename, position and other informations.
Huh... think I implemented your suggestion Colin. But somehow unable to get Hashmap paramters.. Can you have a look pls, programming code is here <pastebin.com/wScB5RSy>
Wow! Your code is just beatifull!! How many years programming expirience are behind this style!? It seems to be a HaschJoin, that "meriton" suggested me before. Right?
It is kind of like the "HashJoin" of @meriton. But I kept your code, so it doesn't remove punctuation and doesn't compare with lower case words.
|
2

You are doing a join with nested loops? Yes, that is O(n^2). What about doing a hash join instead? That is, create a map from (lowercased) strText to t and do lookups with this map rather than iterating over the token container?

5 Comments

Hello Meriton, Thank You for helping as well. Yes I did it, but don't wanna more. Performance was also ok with small strings, another reason with nested loops I was able to store strPositions(of same substrings) (almost) sorted in a Vector.
Yes I got it already there is no way around Hashing and Mapping.. I have to lear it..:-( Can you tell me pls, how can I do a HashJoin in Java? Didn't find any java related example -especially HashJoin in google... And if do HashJoin how can I store subStr-positons, these are necessary to store.
Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really understood.. Thanks in advance. Please tell me also, why it is required to lowercase String? How can I create a "map from (lowercased) strText to t"? This sentence I didn’t really undertood..
lowercase is so that your program will consider "Token" and "token" to be the same word.
Yes, I know what lowercase does. But what kind benefits I will get if using in (Hash)map related context? ( ... Interger!=integer... )
0

Put the tokens of fileA into a trie data structure. Then when tokenising fileB you can check quite quickly if these tokens are in the trie. A few code comments would help.

2 Comments

Thanks James, which data structure would you suggest to use?
More comments, with pleasure.. I am reading txt files with help of java tokenizer into a string and then trying to search substring of DocA in DocB. Doing this in 2 cases. 1st case substr-length is constat, 2nd case substr-length vary, for this reason i added "if (strLengthA >= dp.getMinStrLength())" to reduce iteretion for very short substrings.
0

A said, this is an issue of complexity and you're algorithm runs in O(n^2) instead of O(n) using hash.

For second order improvements try to call less to functions, for example you can get the size once

sizeB = txtB.TokenContainer.size();

Depeneds on the size, you may call the container once to get an array of strings to save the getStr....

Roni

1 Comment

Thanks Roni, I was not sure if calling functions will take some performance. But of course, especially "txtB.TokenContainer.size();" programm calls every n-times, absolutly unnecessary.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.