Low performance in non-sequential iteration over int Array in JAVA

Question

I have the following function:

public void scanText(char[] T){
    int q=0;
    for(int i=0;i<T.length;i++){
        q = transFunc[preCompRow[q]+T[i]];
        if(q==pattern.length){
            System.out.println("match found at position: "+(i-pattern.length+2));
        }
    }
}

This function scans a char Array searching for matches of a given pattern, which is stored as a finite automata. The transition function of the automata is stored in the variable called transFunc.

I am testing this function in a text with 8 millions of characters and using 800000 patterns. The thing is the accession of the array preCompRow[q] (which is an int[]) is very slow. The performance is greatly improved if I delete the preCompRow[q] of the code. I think this might be because in every loop the q variable has a different non-sequential value (2, 56, 18, 9 ..).

Is there any better way to access to an array in a non-sequential manner?

Thanks in advance!

I don't think that the java access to an array is the slow thing. ¿How is your ram? maybe you have all in ram and the OS is swapping a lot, so it goes to the disk to get your array position. Try it with a few characters, to see the performance without using a lot of ram. — leoxs
– leoxs, Commented Jun 11, 2017 at 1:37
@Andreas I searched 4000 different patterns over a char array with 8324701 characters and it took 143 seconds. Then I run the same test but I change transFunc[preCompRow[q]+T[i]] for transFunc[T[i]] and it took only 9 seconds. @leoxs I didn't see any swap activity during the execution of both tests. I am using 2 GB for the heap — AAA
– AAA, Commented Jun 11, 2017 at 2:00
I you really need performance, you could code that in C and use it as a module in your java project. If that can run in parallel, thinks about using threads, or better, doing that on GPU. — leoxs
– leoxs, Commented Jun 11, 2017 at 2:06
I also thought the bottleneck was in +, but I tested several cases transFunc[T[i]], transFunc[1+T[I]], transFunc[T[I]] and all had reasonable performance, but every time I add preCompRow[q] inside the index the whole thing goes wrong. When I fix q to some value, things goes really fast, for example transFunc[preCompRow[1]+T[I]]. My conclusion was the problem was the value of q which in every pass of the loop can take a value between 1 - # characters of the pattern — AAA
– AAA, Commented Jun 11, 2017 at 2:26

Stephen C · Accepted Answer · 2017-06-11 02:52:29Z

1

One possible explanation is that your code is seeing poor memory performance due to poor locality in its memory access patterns.

The role of the memory caches in a modern computer is to deal with the speed mismatch between processor instruction times (less than 1 ns) and main memory (5 to 10 ns or more). They work best when your code gets a cache hit most time it fetches from memory.

A modern Intel chipset caches memory in blocks of 64 bytes, and loads from main memory in burst mode. (That corresponds to 16 int values.) The L1 cache on (say) an I7 processor is 2MB.

If your application is able to access the data in a large array (roughly) sequentially, then 7 out of 8 accesses will be a cache hits. If the access pattern is non-sequential and the "working set" of is a large multiple of the cache size, then you may end up with a cache miss on each memory access.

If memory access locality is the root of yoiur problems, then your option are limited:

redesign your algorithm so that locality of memory references is better
buy hardware with larger caches
(maybe) redesign your algorithm to use GPUs or some other strategy to reduce the memory traffic

Recoding your existing in C or C++ may give a performance improvement, but the same memory locality problems will bite you there as well.

I am not aware of any tools that can be used to measure cache performance in Java applications.

edited Jun 11, 2017 at 2:52

answered Jun 11, 2017 at 2:47

Stephen C

723k95 gold badges849 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andreas Over a year ago

I think you're on the right track. Since T values are char, they are likely limited to ASCII, i.e. within a numeric range of 32 and 126. transFunc[T[i]] will then only access those indexes of transFunc, and that entire range (760 bytes in total) will quickly be in the L1 cache. --- For transFunc[preCompRow[q]+T[i]], we don't know range of preCompRow values, i.e. size of transFunc, but if that is large, it will take a lot longer to load it into L1 cache, and may even exceed size of L1 cache, greatly increasing cache miss ratio, even for the same index values.

AAA Over a year ago

Yes, I think the problem is related to cache. preComRow is just an int array of the size of the pattern. Rows are precomputed because I am using a 2D array for the transition function but expressed as a 1D array (I thought I was having problem with JAVA 2D arrays emulation), then an array of the form arr[i][j] is expressed as arr[(I*steps) + j ], I precomputed this to avoid the computation in the searching loop. The thing is I know the range of preCompRow.

AAA Over a year ago

I forgot to mention, the patterns I am searching for are very short, only 50 characters long

Stephen C Over a year ago

It would help us if you created a MCVE for your problem so that we can observe and then diagnose the "slow accession" behavior. Without a proper MCVE, I don't think it will be possible to do anything more than guess.

Collectives™ on Stack Overflow

Low performance in non-sequential iteration over int Array in JAVA

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related