3

Preamble: I was watching the first class of this MIT course on youtube and around the time 22:20 the teacher presents the results of a performance comparison of the same algorithm in Java and C. He said that the C implementation is 2x faster than the Java implementation, so I tried it at home to see with my own eyes, but the fact is that the Java implementation turned out to be faster than the C one, and I'm seeking for help to understand why.

The problem implementations

I have these two implementations of the same algorithm:

C

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

#define n 1024
double A[n][n];
double B[n][n];
double C[n][n];

float tdiff(struct timeval *start,
            struct timeval *end)
{
    return (end->tv_sec - start->tv_sec) +
           1e-6 * (end->tv_usec - start->tv_usec);
}

int main(int argc, const char *argv[])
{
    for (int i = 0; i < n; ++i)
    {
        for (int j = 0; j < n; ++j)
        {
            A[i][j] = (double)rand() / (double)RAND_MAX;
            B[i][j] = (double)rand() / (double)RAND_MAX;
            C[i][j] = 0;
        }
    }

    struct timeval start, end;
    gettimeofday(&start, NULL);

    for (int i = 0; i < n; ++i)
    {
        for (int j = 0; j < n; ++j)
        {
            for (int k = 0; k < n; ++k)
            {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }

    gettimeofday(&end, NULL);
    printf("%0.6f\n", tdiff(&start, &end));

    double d = 0;
    for (int i = 0; i < n; ++i)
    {
        for (int j = 0; j < n; ++j)
        {
            d += C[i][j];
        }
    }
    printf("Fim %0.6f\n", d);
    return 0;
}

Java

import java.util.Random;

public class TesteMatrixMultiply{

    public final static int n  =  1024;
    static double[][] A = new double[n][n];
    static double[][] B = new double[n][n];
    static double[][] C = new double[n][n];

    public static void main (String args[])
    {
        Random r = new Random();

        for (int i = 0; i < n; ++i)
        {
            for (int j = 0; j < n; ++j)
            {
                A[i][j] = r.nextDouble();
                B[i][j] = r.nextDouble();
                C[i][j] = 0;
            }
        }

        long start = System.nanoTime();

        for (int i = 0; i < n; ++i)
        {
            for (int j = 0; j < n; ++j)
            {
                for (int k = 0; k < n; ++k)
                {
                    C[i][j] += A[i][k] * B[k][j];
                }
            }
        }

        long stop = System.nanoTime();
        double tdiff = (stop - start) * 1e-9;
        System.out.println(tdiff);

        double d = 0;
        for (int i = 0; i < n; ++i)
        {
            for (int j = 0; j < n; ++j)
            {
                d+= C[i][j];
            }
        }

        System.out.printf("Fim %.6f\n", d);

    }

}

I Run the two implementations using the Linux perf tool, so I could collect some info.

C execution

16.867058
Fim 268364458.846206

 Performance counter stats for './teste-matrix-multiply':

      16896,010532      task-clock (msec)         #    1,000 CPUs utilized          
                76      context-switches          #    0,004 K/sec                  
                 2      cpu-migrations            #    0,000 K/sec                  
             6.196      page-faults               #    0,367 K/sec                  
    38.943.668.123      cycles                    #    2,305 GHz                      (49,98%)
    27.050.004.884      instructions              #    0,69  insn per cycle           (62,51%)
     2.194.116.427      branches                  #  129,860 M/sec                    (62,51%)
         1.323.195      branch-misses             #    0,06% of all branches          (62,50%)
    11.889.108.593      L1-dcache-loads           #  703,664 M/sec                    (62,51%)
     1.089.638.099      L1-dcache-load-misses     #    9,17% of all L1-dcache hits    (62,52%)
     1.082.875.343      LLC-loads                 #   64,091 M/sec                    (49,99%)
       835.764.813      LLC-load-misses           #   77,18% of all LL-cache hits     (49,99%)

      16,898358467 seconds time elapsed

Java execution

6.551244991000001
Fim 268384388.815752

 Performance counter stats for 'java TesteMatrixMultiply':

       6754,506236      task-clock (msec)         #    1,013 CPUs utilized          
               581      context-switches          #    0,086 K/sec                  
                27      cpu-migrations            #    0,004 K/sec                  
            12.938      page-faults               #    0,002 M/sec                  
    23.531.901.353      cycles                    #    3,484 GHz                      (49,66%)
    19.009.664.164      instructions              #    0,81  insn per cycle           (62,09%)
     3.342.430.354      branches                  #  494,845 M/sec                    (62,07%)
         3.524.682      branch-misses             #    0,11% of all branches          (62,32%)
     7.752.842.493      L1-dcache-loads           # 1147,803 M/sec                    (62,73%)
     2.676.041.100      L1-dcache-load-misses     #   34,52% of all L1-dcache hits    (62,88%)
     1.067.905.566      LLC-loads                 #  158,103 M/sec                    (50,34%)
        55.152.268      LLC-load-misses           #    5,16% of all LL-cache hits     (50,00%)

       6,669662205 seconds time elapsed

Compilations

The Java implementation was compiled with:

library: zulu11.35.13-ca-jdk11.0.5-linux_x64
call: javac TesteMatrixMultiply.java

The C implementation was compiled with:

library: clang+llvm-10.0.0-x86_64-linux-gnu-ubuntu-18.04
call: clang  teste-matrix-multiply.c -o teste-matrix-multiply-clang

My CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Stepping:            9
CPU MHz:             2100.157
CPU max MHz:         3500,0000
CPU min MHz:         400,0000
BogoMIPS:            5808.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

It's clear that the C execution suffers a lot from cache misses, but I can't understand why Java execution doesn't have the same problem.

Update 1

I really didn't know about the -O3 option before,Now I recompiled using it and the performance has been improved a lot, but the cache misses are still high.

7.736069
Fim 268364458.846206

 Performance counter stats for './teste-matrix-multiply':

       7801,043965      task-clock (msec)         #    1,000 CPUs utilized          
                10      context-switches          #    0,001 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             6.199      page-faults               #    0,795 K/sec                  
    13.733.804.981      cycles                    #    1,761 GHz                      (49,96%)
     5.545.160.735      instructions              #    0,40  insn per cycle           (62,47%)
       578.468.520      branches                  #   74,153 M/sec                    (62,47%)
         1.231.327      branch-misses             #    0,21% of all branches          (62,47%)
     2.193.986.457      L1-dcache-loads           #  281,243 M/sec                    (62,53%)
     1.132.220.799      L1-dcache-load-misses     #   51,61% of all L1-dcache hits    (62,55%)
     1.158.935.692      LLC-loads                 #  148,562 M/sec                    (50,04%)
       683.217.415      LLC-load-misses           #   58,95% of all LL-cache hits     (49,98%)

       7,801911690 seconds time elapsed

It is still about 1sec behind Java, even though we are not considering the warmup of the JVM.

I know that it depends on my specific hardware. But I'd like to be able to explain why I'm getting these numbers. What tools, or diagnostic commands I should look at? Imagine that it is not my notebook, it is a production server and I should be able to understand what is happening. What should I do? Optimize the code is not a option on this exercise.

6
  • 4
    What optimization flags are you using for the C version? Commented Jun 19, 2020 at 4:34
  • 2
    Are the int sizes the same number of bits for both programs? Commented Jun 19, 2020 at 5:38
  • 3
    As a side note, for performance critical code, Java developers would consider using flat arrays of n*n size instead of two-dimensional arrays, due to the “array of array” nature of Java’s multi-dimensional arrays. Further, they’d load the array references into local variables instead of re-reading them from static variables on each access. And the innermost operation can be replaced by Math.fma(…). Commented Jun 19, 2020 at 6:51
  • @Shawn I updated the question to present how I compiled the code. I didn't know about the -O* options, I have just learned it. -O3 really improved the performance. Commented Jun 20, 2020 at 1:59
  • The C version can also use fma(), but to make best use of it, you have to tell the compiler to use the relevant instructions with -march=native or -mfma (And it might be able to optimize the current code to using them if you do that) Commented Jun 20, 2020 at 2:09

1 Answer 1

1

I'm unable to reproduce your results. On my system, I get results similar to the video.

Are you sure you're compiling the C code with optimization (e.g. -O3)?


Java output:

17.902893012
Fim 268642903.459048

Without optimization, the C output (from gcc) is:

22.350832
Fim 268364458.846206

The C code output (with -O3):

5.077404
Fim 268364458.846206

However, clang on my system is a bit slower than gcc (~2x slower) ...

Without optimization (from clang):

19.541418
Fim 268364458.846206

With -O3 (from clang):

10.505201
Fim 268364458.846206

My system is linux, fedora 29 (uname -a):

Linux myhost 5.3.11-100.fc29.x86_64 #1 SMP Tue Nov 12 20:41:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

gcc version is:

gcc (GCC) 8.3.1 20190223 (Red Hat 8.3.1-2)

clang version is:

clang version 7.0.1 (Fedora 7.0.1-6.fc29)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

javac version is:

javac 1.8.0_232

java version is:

openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

I'm on a 4 core linux machine with hyperthreading, so 8 effective cores. Here is the first part of my /proc/cpuinfo:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
stepping    : 5
microcode   : 0x1d
cpu MHz     : 2786.179
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 5307.00
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
Sign up to request clarification or add additional context in comments.

2 Comments

Note that the OP is using jdk 11 which is faster than jdk 8.
I updated the question with -O3 option. Thanks for you tests, they are useful, but the objective of my question is to learn how to analyze the executions on a specific machine, and not to emit a benchmark useful for everyone.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.