Slow recursion in python

Question

I know this subject is well discussed but I've come around a case I don't really understand how the recursive method is "slower" than a method using "reduce,lambda,xrange".

def factorial2(x, rest=1):
    if x <= 1:
        return rest
    else:
        return factorial2(x-1, rest*x)


def factorial3(x):
    if x <= 1:
        return 1
    return reduce(lambda a, b: a*b, xrange(1, x+1))

I know python doesn't optimize tail recursion so the question isn't about it. To my understanding, a generator should still generate n amount of numbers using the +1 operator. So technically, fact(n) should add a number n times just like the recursive one. The lambda in the reduce will be called n times just as the recursive method... So since we don't have tail call optimization in both case, stacks will be created/destroyed and returned n times. And a if in the generator should check when to raise a StopIteration exception.

This makes me wonder why does the recursive method still slowlier than the other one since the recursive one use simple arithmetic and doesn't use generators.

In a test, I replaced rest*x by x in the recursive method and the time spent got reduced on par with the method using reduce.

Here are my timings for fact(400), 1000 times

factorial3 : 1.22370505333

factorial2 : 1.79896998405

Edit:

Making the method start from 1 to n doesn't help either instead of n to 1. So not an overhead with the -1.

Also, can we make the recursive method faster. I tried multiple things like global variables that I can change... Using a mutable context by placing variables in cells that I can modify like an array and keep the recursive method without parameters. Sending the method used for recursion as a parameter so we don't have to "dereference" it in our scope...?! But nothings makes it faster.

I'll point out that I have a version of the fact that use a forloop that is much faster than both of those 2 methods so there is clearly space for improvement but I wouldn't expect anything faster than the forloop.

I have the feeling that it's range being optimized. Replacing range/xrange by my own generator makes things much worse. Chances are that range is implemented in c and not in python which would explain why the recursive method cannot beat any method using range. Also, results could be much different using a different interpreter then... Like Pypy. — Loïc Faure-Lacroix
– Loïc Faure-Lacroix, Commented Feb 9, 2017 at 8:01
How much worse does your own generator make it? I tested it as well and while it was worse than Python's xrange, it was still much better than the recursive solution. — Stefan Pochmann
– Stefan Pochmann, Commented Feb 9, 2017 at 8:09
While the timing difference is probably significant, the two runtimes are still within the same order of magnitude. So range/xrange doing the adding/iterating in C vs. factorial2 or your own generator doing it in python might well explain the difference. Additionally without tail recursion, factorial2 has to build up the recursive call stack (which means iterative memory allocation) and each iteration has a branch, which might hurt optimizations such as pipelining. — das-g
– das-g, Commented Feb 9, 2017 at 8:10
@das-g You think that calling multiple times a lambda in an iterative way would make python reuse the stack it just destroyed? I guess that could be less heavy for the VM than create 100 stacks and then destroy 100 stacks. Because if you reuse memory you don't have to allocate that much. — Loïc Faure-Lacroix
– Loïc Faure-Lacroix, Commented Feb 9, 2017 at 8:14
Your recursion variant is unnecessarily slowed down by the second parameter: return x * factorial2(x-1) is 10% faster. Your reduce variant is slowed down by the lambda: operator.mul saves 20%. — guidot
– guidot, Commented Feb 9, 2017 at 8:21

Community · Accepted Answer · 2017-05-23 10:29:52Z

7

The slowness of the recursive version comes from the need to resolve on each call the named (argument) variables. I have provided a different recursive implementation that has only one argument and it works slightly faster.

$ cat fact.py 
def factorial_recursive1(x):
    if x <= 1:
        return 1
    else:
        return factorial_recursive1(x-1)*x

def factorial_recursive2(x, rest=1):
    if x <= 1:
        return rest
    else:
        return factorial_recursive2(x-1, rest*x)

def factorial_reduce(x):
    if x <= 1:
        return 1
    return reduce(lambda a, b: a*b, xrange(1, x+1))

# Ignore the rest of the code for now, we'll get back to it later in the answer
def range_prod(a, b):
    if a + 1 < b:
        c = (a+b)//2
        return range_prod(a, c) * range_prod(c, b)
    else:
        return a
def factorial_divide_and_conquer(n):
    return 1 if n <= 1 else range_prod(1, n+1)

$ ipython -i fact.py 
In [1]: %timeit factorial_recursive1(400)
10000 loops, best of 3: 79.3 µs per loop
In [2]: %timeit factorial_recursive2(400)
10000 loops, best of 3: 90.9 µs per loop
In [3]: %timeit factorial_reduce(400)
10000 loops, best of 3: 61 µs per loop

Since in your example very large numbers are involved, initially I suspected that the performance difference might be due to the order of multiplication. Multiplying on every iteration a large partial product by the next number is proportional to the number of digits/bits in the product, so the time complexity of such a method is O(n²), where n is the number of bits in the final product. Instead it is better to use a divide and conquer technique, where the final result is obtained as a product of two approximately equally long values each of which is computed recursively in the same manner. So I implemented that version too (see factorial_divide_and_conquer(n) in the above code) . As you can see below it still loses to the reduce()-based version for small arguments (due to the same problem with named parameters) but outperforms it for large arguments.

In [4]: %timeit factorial_divide_and_conquer(400)
10000 loops, best of 3: 90.5 µs per loop
In [5]: %timeit factorial_divide_and_conquer(4000)
1000 loops, best of 3: 1.46 ms per loop
In [6]: %timeit factorial_reduce(4000)
100 loops, best of 3: 3.09 ms per loop

UPDATE

Trying to run the factorial_recursive?() versions with x=4000 hits the default recursion limit, so the limit must be increased:

In [7]: sys.setrecursionlimit(4100)
In [8]: %timeit factorial_recursive1(4000)
100 loops, best of 3: 3.36 ms per loop
In [9]: %timeit factorial_recursive2(4000)
100 loops, best of 3: 7.02 ms per loop

edited May 23, 2017 at 10:29

CommunityBot

11 silver badge

answered Feb 9, 2017 at 9:13

Leon

32.8k4 gold badges77 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Loïc Faure-Lacroix Over a year ago

Nice, I didn't think about doing that as I'm used to make recursive function tail call optimizable. I realize that in python it's pointless to do so. By making the recursive function as a tree process, it allows the method to not only work faster but make the stack tree much less deep avoiding maximum recursion depth. You could still reach it with very large numbers but that a clear improvement on my version. Also, as pointed out in the comments, I'd guess that you need to allocate less memory as the stack tree is less deep. Resulting in speed improvement.

Stefan Pochmann Over a year ago

Can you add timings for factorial_recursive1 and factorial_recursive2 with x=4000 as well?

Leon Over a year ago

@StefanPochmann I can't - they exceed the recursion depth limit

Stefan Pochmann Over a year ago

@Leon Can't you increase it? import sys; sys.setrecursionlimit(4100)

Leon Over a year ago

@guidot you meant O(log(n)²). I considered that while writing the answer, but I think that my version is easier to comprehend.

|

Collectives™ on Stack Overflow

Slow recursion in python

1 Answer 1

12 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related