0

I have read this blog which shows how an algorithm had a 250x speed-up by using numpy. I have tried to improve the following code by using numpy but I couldn't make it work:

    for i in nodes[1:]:
        for lb in range(2, diameter+1):
            not_valid_colors = set()
            valid_colors = set()

            for j in nodes:
                if j == i:
                    break

                if distances[i-1, j-1] >= lb:
                    not_valid_colors.add(c[j, lb])
                else:
                    valid_colors.add(c[j, lb])

                c[i, lb] = choose_color(not_valid_colors, valid_colors)

    return c

Explanation

The code above is part of an algorithm used to calculate the self similar dimension of a graph. It works basically by constructing dual graphs G' where a node is connected to each other node if the distance between them is greater or equals to a given value (Lb) and then compute the graph coloring on those dual networks.

The algorithm description is the following:

  1. Assign a unique id from 1 to N to all network nodes, without assigning any colors yet.
  2. For all Lb values, assign a color value 0 to the node with id=1, i.e. C_1l = 0.
  3. Set the id value i = 2. Repeat the following until i = N.
    • a) Calculate the distance l_ij from i to all the nodes in the network with id j less than i.
    • b) Set Lb = 1
    • c) Select one of the unused colors C[ j][l_ij] from all nodes j < i for which l_ij ≥ Lb . This is the color C[i][Lb] of node i for the given Lb value.
    • d) Increase Lb by one and repeat (c) until Lb = Lb_max.
    • e) Increase i by 1.

I wrote it in python but it takes more than a minute when try to use it with small networks which have 100 nodes and p=0.9.

As I'm still new to python and numpy I did not find the way to improve its efficiency.

Is it possible to remove the loops by using the numpy.where to find where the paths are longer than the given Lb? I tried to implement it but didn't work...

8
  • 1
    But there are lot more numpy knowledgeable posters on SO than on CR. CR also tends to be very picky about question format. 'vectorizing' is a common topic on SO numpy. Commented Oct 15, 2015 at 17:26
  • 2
    numpy can speed things up a lot IF your problem is essential parallel in nature - do the same thing for a elements, regardless of order. But if the problem is serial - the action of the i'th element depends on prior action on the 'i-1', then numpy usually does not help. Commented Oct 15, 2015 at 17:28
  • Agree with @hpaulj on asking performance, vectorizing related questions on here. Also to OP, it would be easier if instead of sets, lists or even better would be using numpy arrays, when looking to vectorize codes. Commented Oct 15, 2015 at 17:31
  • Please add a minimal example of nodes and distances and your attempt to vectorize with numpy. Commented Oct 15, 2015 at 17:35
  • Using for j in nodes[:i] is more pythonic than an explicit if ... then break. Commented Oct 15, 2015 at 17:40

2 Answers 2

1

Vectorized operations with numpy arrays are fast since actual calculations are done with underlying libraries such as BLAS and LAPACK without Python overheads. With loop-intensive operations, you will not see those benefits.

You usually have to figure out a way to vectorize operations (usually possible with a smart use of array slicing). Some operations are inherently loop-intensive, however, and sometimes it is not easy to vectorize them (which seems to be the case for your code).

In those cases, you can first try Numba, which generates optimized machine code from a Python function without any modifications. (You just annotate the function and it will automatically do it for you). I do not have a lot of experience with it, and have not tried using this for complicated functions.

If this does not work, then you can use Cython, which converts Python-like code (with typed variables) into efficient C code automatically and generates a Python extension module that you can import and use in Python. That will usually give you at least an order of magnitude (usually two orders of magnitude) speedup for loop-intensive operations. I generally find Cython easy to use since unlike pure C, one can access your numpy arrays directly in Cython code.

I recommend using Anaconda Python distribution, since you will be able to install these packages easily. I'm sorry I don't have a specific answer for your code.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for the explanation and suggestions, I'll try both options and let you know the results!
You do not have to use numpy to use Cython. In fact it may be more effective with regular Python list operations - check its docs.
I never said you have to. :)
Thank you again @joon, I have tested cython an the performace was better. I am sure it can be improved still more so I opened another question to make a follow up on how to improve the cython code. you can see it in stackoverflow.com/questions/33160102
1

if you want to go to numpy, you can just change the lists into arrays, for example distances[i-1][j-1] becomes distances[i-1, j-1] after you declare distances as a numpy array. same with c[i][lb]. About valid_colors and not_valid_colors you should think a bit more because with numpy arrays you cannot append things: the array have fixed length, so you should fix a maximum size before. Another idea is that after you have everything in numpy, you can cythonize your code http://docs.cython.org/src/tutorial/cython_tutorial.html it means that all your loops will become very fast. In any case, if you don't want cython and you look at the blog, you see that distances is declared as an array in the main()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.