Why is my python/numpy example faster than pure C implementation?

Question

I have pretty much the same code in python and C. Python example:

import numpy
nbr_values = 8192
n_iter = 100000

a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
    a = numpy.sin(a)

C example:

#include <stdio.h>
#include <math.h>
int main(void)
{
  int i, j;
  int nbr_values = 8192;
  int n_iter = 100000;
  double x;  
  for (j = 0; j < nbr_values; j++){
    x = 1;
    for (i=0; i<n_iter; i++)
    x = sin(x);
  }
  return 0;
}

Something strange happen when I ran both examples:

$ time python numpy_test.py 
real    0m5.967s
user    0m5.932s
sys     0m0.012s

$ g++ sin.c
$ time ./a.out 
real    0m13.371s
user    0m13.301s
sys     0m0.008s

It looks like python/numpy is twice faster than C. Is there any mistake in the experiment above? How you can explain it?

P.S. I have Ubuntu 12.04, 8G ram, core i5 btw

did you compile your C code with optimizations? (-O2 or -O3) — Joe
– Joe, Commented Jan 22, 2013 at 19:58
With -O3 the C version is about 18000 times faster on my machine - probably because it optimises ALL of the loops away... ;) — Mats Petersson
– Mats Petersson, Commented Jan 22, 2013 at 20:03
@szx No, it only sets up the return value of main() and exits, never even calling sin. — phant0m
– phant0m, Commented Jan 22, 2013 at 20:09

Omnifarious · Accepted Answer · 2013-01-22 20:26:06Z

19

First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.

Here is equivalent C code:

sinary2.c:

#include <math.h>
#include <stdlib.h>

float *sin_array(const float *input, size_t elements)
{
    int i = 0;
    float *output = malloc(sizeof(float) * elements);
    for (i = 0; i < elements; ++i) {
        output[i] = sin(input[i]);
    }
    return output;
}

sinary.c:

#include <math.h>
#include <stdlib.h>

extern float *sin_array(const float *input, size_t elements)

int main(void)
{
    int i;
    int nbr_values = 8192;
    int n_iter = 100000;
    float *x = malloc(sizeof(float) * nbr_values);  
    for (i = 0; i < nbr_values; ++i) {
        x[i] = 1;
    }
    for (i=0; i<n_iter; i++) {
        float *newary = sin_array(x, nbr_values);
        free(x);
        x = newary;
    }
    return 0;
}

Results:

$ time python foo.py 

real    0m5.986s
user    0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out 

real    0m5.204s
user    0m4.995s
sys 0m0.208s

The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.

Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.

Simply calling the sin function the appropriate number of times does not make for an equivalent test.

Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.

edited Jan 22, 2013 at 20:26

answered Jan 22, 2013 at 20:14

Omnifarious

56.4k20 gold badges142 silver badges203 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

NPE Over a year ago

I think it would instructive to understand which features in particular make the original C code so much slower than your version.

Omnifarious Over a year ago

@NPE - I think it's actually just the optimizer. Almost everything else I did would likely have made the code slower. :-) I'll check though. Yeah, definitely the optimizer is what did it. Without the optimizer my code is MUCH slower.

NPE Over a year ago

If we eliminate the datatype difference (on my box, it has little effect on performance), it's not unreasonable to expect sinf() to be the dominant cost. Yet there's something about the OP's code that's very costly. From looking at the source and the disassembly, it's not obvious to me what it might be...

Omnifarious Over a year ago

@NPE: That's a tricky question. Because the optimizer will make mincemeat of the OPs original code. My guess is that the loop maintenance code is what takes the time. I think the only real way to tell is to look at the assembly output and make some educated guesses. That's actually the chief feature that makes numpy fast is that it moves all the loop maintenance code into C.

Mats Petersson Over a year ago

I just ran oprofile on the original code as supplied above (with a printf("%f\n", x) at the end), and it spends about 65% of the time in "sin".

|

Mats Petersson · Accepted Answer · 2013-01-22 20:21:35Z

2

Because "numpy" is a dedicated math library implemented for speed. C has standard functions for sin/cos, that are generally derived for accuracy.

You are also not comparing apples with apples, as you are using double in C, and float32 (float) in python. If we change the python code to calculate float64 instead, the time increases by about 2.5 seconds on my machine, making it roughly match with the correctly optimized C version.

If the whole test was made to do something more complicated that requires more control structres (if/else, do/while, etc), then you would probably see even less difference between C and Python - because the C compiler can't really do "sin" any faster - unless you implement a better "sin" function.

Newer mind the fact that your code isn't quite the same on both sides... ;)

answered Jan 22, 2013 at 20:21

Mats Petersson

130k15 gold badges147 silver badges233 bronze badges

1 Comment

R.. GitHub STOP HELPING ICE Over a year ago

I question whether numpy's sin is even correct. Most 'optimized' trig implementations get argument reduction horribly wrong.

cdcdcd · Accepted Answer · 2017-03-09 15:17:15Z

You seem to be doing the the same operation in C 8192 x 10000 times but only 10000 in python (I haven't used numpy before so I may misunderstand the code). Why are you using an array in the python case (again I'm not use to numpy so perhaps the dereferencing is implicit). If you wish to use an array be careful doubles have a performance hit in terms of caching and optimised vectorisation - you're using different types between both implementations (float vs double) but given the algorithm I don't think it matters.

The main reason for a lot of anomalous performance benchmark issues surrounding C vs Pythis, Pythat... Is that simply the C implementation is often poor.

https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en

If you notice the guy writes C to process an array of doubles (without using restrict or const keywords where he could've), he builds with optimisation then forces the compiler to use SIMD rather than AVE. In short the compiler is using an inefficient instruction set for doubles and the wrong type of registers too if he wanted performance - you can be sure the numba and numpy will be using as many bells and whistles as possible and will be shipped with very efficient C and C++ libraries to begin with. In short if you want speed with C you have to think about it, you may even have to disassemble the code and perhaps disable optimisation and use compiler instrinsics instead. It gives you the tools to do it so don't expect the compiler to do all the work for you. If you want that degree of freedom use Cython, Numba, Numpy, Scipy etc. They're very fast but you won't be able to eek out every bit of performance out of the machine - to do that use C, C++ or new versions of FORTRAN.

Here is a very good article on these very points (I'd use SciPy):

https://www.scipy.org/scipylib/faq.html

Collectives™ on Stack Overflow

Why is my python/numpy example faster than pure C implementation?

3 Answers 3

10 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related