21

Why is this Python NumPy code,

import numpy as np
import time

k_max = 40000
N = 10000

data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)

t1 = time.time()
for k in xrange(1,k_max+1):
    cos_k = np.cos(k*data[0,:])
    sin_k = np.sin(k*data[0,:])
    coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
    coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()

print('Time:')
print(t2-t1)

faster than the following C++ code?

#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>

using namespace std;

// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;

int main()
{
    time_t start, stop;
    double diff;
    // table with data
    double data1[ N ];
    double data2[ N ];
    // table of results
    double coefs1[ k_max ];
    double coefs2[ k_max ];
    // main loop
    time( & start );
    for( unsigned int j = 1; j<N; j++ )
    {
        for( unsigned int i = 0; i<k_max; i++ )
        {
            coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
            coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
        }
    }
    // end of main loop
    time( & stop );
    // speed result
    diff = difftime( stop, start );
    cout << "Time: " << diff << " seconds";
    return 0;
}

The first one shows: "Time: 8 seconds" while the second: "Time: 11 seconds"

I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?

Version 2 of the code

I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.

Version 3 of the code

I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).

14
  • 2
    Probably because you're using an array of pointers to dynamically allocated arrays, instead of a single block of memory. Commented Dec 28, 2016 at 16:17
  • 4
    I get Time: 1 sekund on my machine. Did you run the C++ code in debug mode by any chance? Commented Dec 28, 2016 at 16:25
  • 2
    @Mateusz in your numpy example, all of the data are 0's I believe (I assume that is what .zeros does). What happens if you set all of your data in the C++ app to 0 first? Commented Dec 28, 2016 at 16:38
  • 3
    @BoPersson I also run this code (c++) and result was also 1 s, where python code had the same time as stated - so debug mode is probably the case. Commented Dec 28, 2016 at 17:07
  • 2
    @Gwidryj I don't have python installed on my machine, so, to test it, I chose ideone. I can understand why python version got time limit exceeded (given the timeframes mentioned in the question), but since C++ code finished in less than a second, it was clearly faster than python version. And that was the whole point of my comment. Commented Dec 28, 2016 at 21:29

5 Answers 5

65

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).

The answer is: your C++ code is not slower than your Python code when properly compiled.

I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.

I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):

#include <math.h>
#include <stdio.h>
#include <time.h>

const int k_max = 100000;
const int N = 10000;

int main(void)
{
    clock_t t_start, t_end;
    double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
    int z;
    for( z = 0; z < N; z++ )
    {
        data1[z] = 0.1;
        data2[z] = 0.4;
    }

    int i, j;
    t_start = clock();
    for( i = 0; i < k_max; i++ )
    {
        for( j = 0; j < N-1; j++ )
        {
            coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
            coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
        }
    }
    t_end = clock();

    seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("Time: %f s\n", seconds);
    return coefs1[0];
}

For k_max = 100000, N = 10000 results where following:

  • Python 70.284362 s
  • C++ 69.133199 s
  • C 61.638186 s

Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.

For k_max = 1000000, N = 1000 we have:

  • Python 115.42766 s
  • C++ 70.781380 s

For k_max = 1000000, N = 100:

  • Python 52.86826 s
  • C++ 7.050597 s

So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:

  • Python 0.651587 s
  • C++ 0.568518 s

Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.

Python script:

import numpy as np
from time import time

N = 10000
x_len = 100000

def main():
    x = np.ones(x_len, dtype=np.float64) * 1.2345

    start = time()
    for i in xrange(N):
        y1 = np.cos(x, dtype=np.float64)
    end = time()
    print('cos: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        y2 = x * 7.9463
    end = time()
    print('multi: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        res = np.sum(x, dtype=np.float64)
    end = time()
    print('sum: {} s'.format(end-start))

    return y1, y2, res

if __name__ == '__main__':
    main()

# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s

C script:

#include <math.h>
#include <stdio.h>
#include <time.h>

const int N = 10000;
const int x_len = 100000;

int main()
{
    clock_t t_start, t_end;
    double x[x_len], y1[x_len], y2[x_len], res, time;
    int i, j;
    for( i = 0; i < x_len; i++ )
    {
        x[i] = 1.2345;
    }

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y1[i] = cos(x[i]);
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("cos: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y2[i] = x[i] * 7.9463;
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("multi: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        res = 0.0;
        for( i = 0; i < x_len; i++ )
        {
            res += x[i];
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("sum: %f s\n", time);

    return y1[0], y2[0], res;
}

// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s

Python results:

  • cos: 22.7199969292 s
  • multi: 0.841291189194 s
  • sum: 1.15971088409 s

C results:

  • cos: 20.910590 s
  • multi: 0.633281 s
  • sum: 1.153001 s

As you can see NumPy is incredibly fast, but always a bit slower than pure C.

Sign up to request clarification or add additional context in comments.

8 Comments

Sorry, but I have to downvote--the first code you have in this answer is completely broken (has undefined behavior, and at least in my testing dies with a segfault) and your final C code doesn't even seem to attempt to do the same things as originally intended.
I can successfully run it (despite returning double in place of int), but anyway, I don't understand your comment. In my opinion it does exactly the same, main loops are identical. Could you explain why my "final C code doesn't even seem to attempt to do the same things"?
@JerryCoffin I think I explained it in a paragraph above, but let others judge this.
@analytical_prat The general rule is to use the easiest, most expressive, and highest-level language that still runs fast enough. It'll be easier for you as it produces smaller code with fewer chances to have bugs with higher-level operations that further removes possibilities for bugs (e.g. doing a scalar times a matrix versus a for loop to achieve the same thing). Going straight to C++ is a sign you might be doing premature optimization (or perhaps you know that language the best). I suspect C++ template expressions are much faster for nontrivial cases unlike a scalar times a matrix.
The benchmark is flawed: coefs2, y1 and y2 are written but not read/used. Thus, an optimizing compiler can just remove the loop as they do not have any side effect (this is called the "as if rule" in C/C++). In fact, some compilers like Clang actually do that! For example, Clang 13 with -O3 do not compute y2. ICC do even more optimizations: it does not compute both y1 and y2. The OP code already have this issue though with both coefs1, coefs2... (hopefully mainstream compilers are not so clever to optimize this part yet).
|
13

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...

Numpy is using complex Linear Algebra libraries ! Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.

I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.

enter image description here

C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).

Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from @Jfs, and we observe: "There is no difference between C++ and numpy on my machine."

Some more reference:

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

1 Comment

one thing to note is that even seemingly simple stuff like array.max is actually extremely optimized (assembly) code - written directly by INTEL/AMD/APPLE So as a general rule, as long as the arrays are large naive c code will be slower.
7

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).

I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:

#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>

const unsigned int k_max = 40000;
const unsigned int N = 10000;

template <class T>
class matrix2 {
    std::vector<T> data;
    size_t cols;
    size_t rows;
public:
    matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
    T &operator()(size_t y, size_t x) {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }

    T operator()(size_t y, size_t x) const {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }
};

int main() {
    matrix2<double> data(N, 2);
    matrix2<double> coeffs(k_max, 2);

    using namespace std::chrono;

    auto start = high_resolution_clock::now();

    for (int k = 0; k < k_max; k++) {
        for (int j = 0; j < N - 1; j++) {
            coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
            coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
        }
    }

    auto end = high_resolution_clock::now();
    std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}

This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.

The next obvious step would be to use multiple cores. To do that in C++, we can add this line:

#pragma omp parallel for

...before the outer for loop:

#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
    for (int j = 0; j < N - 1; j++) {
        coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
        coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
    }
}

With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

13 Comments

Well... this is interesting. Your version uses much more c++ structures, but main calculations are the same. I would not expect c++ code (with unnecessary usage of its structures) to be faster than pure c (and here it is faster in the reference to the python version). So I ran your version and original ones and averaged over 10 runs. Results: original c++: 10.8 s, original python 8.5 s, your c++ version: 32.8 s. So, as expected, I can't reproduce improvement of yours.
@Gwidryj: Clearly you're doing something seriously wrong (probably compiling without optimization, but you haven't told us enough for anybody to diagnose your problem with any real certainty). As to using a struct slowing anything down--sorry, but that shouldn't be expected at all.
I am using gcc version 4.9.2 (Debian 4.9.2-10). Instruction to compile both c++ scripts: g++ -std=c++11 main.cpp -o main. I also checked that without initializing data with some finite values your code is ~3 times slower, but with initializing it with non-zeros only ~1.3 times slower.
@JerryCoffin Yes, by finite values I meant not zeros.
@Gwidryj: Zero is literally the furthest possible point from infinity.
|
3

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:

#include <iostream>
#include <cmath>
#include <time.h>

const int k_max = 40000;
const int N = 10000;

double cos_k, sin_k;

int main(int argc, char const *argv[])
{
    time_t start, stop;
    double data[2][N];
    double coefs[k_max][2];

    time(&start);

    for(int i=0; i<k_max; ++i)
    {
        for(int j=0; j<N; ++j)
        {
            coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
            coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
        }
    }
    // End of main loop

    time(&stop);

    // Speed result
    double diff = difftime(stop, start);
    std::cout << "Time: " << diff << " seconds" << std::endl;

    return 0;
}

Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

4 Comments

I have tried the same thinking that locality of reference is the case here. I tried for many values of k_max and N, with zeros and non-zero values in data, but results were always practically the same, regardless the order of loops.
"I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs" - this is not true, there is no way switching for-loops can change the result. In both cases you run through the same sets of (i, j) pairs, just in different order.
@Gwidryj, Sure, it results in the same computation time. It was my observation in order to fully respect the formulae found on paper. What you didnt do it was the optimization flag: g++ main.cpp -o main -O3 .
@Gwidryj, or use -O2. Check what works better for you.
2

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is about as fast as C++. There are some cases when C++ outperforms C due to the language offering optimizations that are impossible in C. Any other speed differences between C and C++ will come down to the compiler you use and the nature of the code being compiled. Comparing languages for speed is a complex topic.

When someone uses NumPy, they set the problem up by writing Python code. That Python code executes at a crawl compared to tuned C or C++ code. However, since the majority of the execution time takes place crunching the numbers in the C-written NumPy libraries, the extra time to set the problem up in Python is irrelevant. If you wrote the same thing entirely in C/C++, your total execution time might be 1% faster as a guess, because whether you wrote that code that sets up the problem in C/C++ or Python, the C-written library will account for the majority of the execution time as it performs the requested calculations. For this reason, people enjoy the niceties of Python while benefiting from the speed of C where it matters.

As an example, let's say you setup a calculation where you want to multiply two square matrices with 1 million entries each. The Python code will execute slowly but in a tiny amount of total time. Then, the C libraries will perform that tremendous matrix multiplication, accounting for the majority of the execution time.

If you are comparing pure Python to tuned C or C++, expect Python to be anywhere between 10 to 1000 times slower than C or C++, depending on what task your program is doing. Pure Python will also tend to use more memory as well. So say you wrote that huge matrix multiplication in only Python rather than using NumPy. The execution time would be tremendous compared to using C/C++ or using NumPy.

look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. And if I'm not mistaken, the top Python solutions end up calling compiled C or C++ libraries to do the majority of the heavy lifting, so the test isn't really comparing pure Python to C++. This same thing happens in other languages e.g. the top Java solutions tend to call code written in C/C++ in some of the problems.

There are many reasons why C/C++ are faster than pure Python.

  • Python is an interpreted language whereas C/C++ compiles into machine code. This extra step of interpretation adds to the execution time. Additionally, the code executes on a virtual machine, which must take some CPU and RAM resources as well. C/C++ code executes directly on the hardware.
  • Python has a garbage collector that must run an algorithm to analyze the state of the program at intervals to prove a chunk of allocated memory no longer needs to exist, and only then will the program release that memory. C/C++ is different. All the allocations and deallocations are done at certain points in the machine code, avoiding running a garbage collector algorithm altogether. This will impact the performance of programs that allocate and deallocate a lot of memory more than programs that don't. To gain an idea of why a garbage collector is slower, every object has to have a reference count. The algorithm does bookkeeping to understand where an object could still be used to keep it around in case it is used. Additionally, the algorithm has to handle cycles like if I had objects A, B, and C with B holding a reference to C, C holding a reference to B, and A holding a reference to B. If everything in the program holds no reference to A, the garbage collector has to figure out that it can deallocate the memory for A, B, and C.
  • There are no primitives in Python. Everything, including the lowly integer, is a full object with metadata and methods. In C/C++, an int is a tiny chunk of 1s and 0s directly representing the number. It is much faster to pass around and use this barebones representation of an integer rather than deal with objects that contain that representation. Let's say you want to check if two integers are equal. In C/C++, machine code does a quick, single assembly instruction to compare two integers located here and there. In Python, a method from the Integer class is called. This entails Python needs to look up the type of the object you're actually referencing, look up the name of the property on that object, check that property actually is a method, compile the method if it hasn't been jit'd, then finally it can jump to the code in the comparison method. if jit isn't enabled then the program starts executing that method's code line by line. This is also related to Python being a dynamic while C/C++ is static. Dynamic roughly means things are looked up at runtime whereas static means things are figured out at compile time and the answer is written directly into the machine code.
  • Everything isn't just an object but also a pointer to an object. In C/C++, you can have a chunk of memory directly housing primitives or objects, so you could have the bits of 10 integers side by side in a solid chunk of memory. In Python, you have 10 pointers side by side in memory, and you have one layer of indirection to access each integer, jumping to a random part of the heap to figure out the value of each integer. This matters for two reasons: First, indirection is slower than having none. Secondly, most computer architectures use spatial locality, meaning when you access a part of memory that is RAM, chunks of nearby memory are loaded up into the incredibly fast CPU cache alongside it. We are talking about a read to RAM being 100 times slower than a read from level 1 cache. The designers of the hardware do that, because often, when you access location 0 in memory, you access location 1 and then 2 and then 3 in memory next. So something like iterating over n integers in C/C++ is about as fast as you can iterate over n things as possible whereas you will only iterate quickly over the n pointers quickly in Python. You will then access random chunks of memory in the heap that hold the integer in question, and the n integers may not even be stored next to each other in the heap.
  • C/C++ has many features that map more closely to the typical operations found in an assembly language. This proximity to the actual hardware enables a programmer to write tuned code to an absurd degree. Python, on the other hand, is pretty much completely divorced from the hardware, meaning a programmer can do no such manual optimizations even if they want speed. This is a good point to say that you don't use Python for speed. You use it for quick development and hopefully fewer bugs.
  • In the case where C++ has higher-level abstractions, you get a boatload of options in the exact way the compiler should implement those abstractions. This control allows for writing faster code if desired. Python instead tries to make the use of its abstractions as convenient and simple as possible.
  • With C/C++, the compiler can perform advanced analysis and optimization when producing the machine code, because waiting a while for your code to compile is an accepted cost to the programmer's time when dealing with these languages. Since Python interprets the code line by line, it simply cannot perform similar optimizations, because "compilation" time adds to execution time! If you were to bring in a ton of heavy analysis and optimizations into the Python interpreter, your program would freeze for minutes or more before it began executing the tuned Python bytecode.
  • C/C++ has a ton of inconvenient undefined behavior in the name of speed. If you mess up, that is on you. The program will blissfully continue executing its machine code, and its output will become undefined. It will likely soon crash, but it could just as easily continue running for days, writing incorrect results to your database. Python is different. It is much more convenient, doing all sorts of checks to report an error as soon as one comes up. Checking things is slower than not checking things.

3 Comments

It is not true that C++ does more optimisation than pure C. This does not make sense and shows some doubts about the languages: compilers make improvements on your code and the compilers for c and cpp are usually the same (clang, gcc...).
@MarineGalantin C++ has templates. Take a look at the speed of std::sort in C++ compared to qsort in C. The language allows for more optimizations. Here, the compiler can inline the comparison function since it knows the types in use. The C solution cannot. Things aren't as clear as I thought though. If you look at The Benchmark Game website, C++ beats C severely in some tests but C beats C++ severely in others. Some are draws. The ones where C wins hands down seem to be parallelized whereas the C++ solution wasn't. C++ wins when they are both parallelized.
That benchmark website shows the complexity of comparing languages if nothing else. Some solutions are parallelized well, others parallelized poorly, others with manual use of SIMD instructions, and all sorts of coding tricks that aren't exactly standard programs written in one language compared to standard programs written in another, relying only on compilers for optimizations. Sometimes, the algorithms used are different too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.