3

I was migrating a production system to async when I realized the synchronous version is 20x faster than the async version. I was able to create a very simple example to demonstrate this in a repeatable way;

Asynchronous Version

import asyncio, time

data = {}

async def process_usage(key):
    data[key] = key

async def main():
    await asyncio.gather(*(process_usage(key) for key in range(0,1000000)))

s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")

This takes 19 seconds. The code loops through 1M keys and builds a dictionary, data with the same key and value.

$ python3.7 async_test.py
Took 19.08 seconds.

Synchronous Version

import time

data = {}

def process_usage(key):
    data[key] = key

def main():
    for key in range(0,1000000):
        process_usage(key)

s = time.perf_counter()
results = main()
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")

This takes 0.17 seconds! And does exactly the same thing as above.

$ python3.7 test.py
Took 0.17 seconds.

Asynchronous Version with create_task

import asyncio, time

data = {}

async def process_usage(key):
    data[key] = key

async def main():
    for key in range(0,1000000):
        asyncio.create_task(process_usage(key))

s = time.perf_counter()
results = asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"Took {elapsed:0.2f} seconds.")

This version brings it down to 11 seconds.

$ python3.7 async_test2.py
Took 11.91 seconds.

Why does this happen?

In my production code I will have a blocking call in process_usage where I save the value of key to a redis database.

9
  • 3
    Well for one, your asynchronous code has to generate a function call with 1 million arguments, which will require loading that into memory. Whereas your synchronous code just uses the efficient range() iterator Commented May 7, 2019 at 16:01
  • @KyleWillmon I'm new to async is there a better way to do this? In production I also have to loop through 1M keys but from a database not the range function. Commented May 7, 2019 at 16:04
  • 2
    As far as I know, you're always going to need quite a bit of overhead to keep track of 1 million coroutines. However, 19 seconds does seem excessive for this trivial example. Perhaps someone else can explain more about that. Commented May 7, 2019 at 16:14
  • 2
    Why would you expect asyncio to be faster here? Your doing completely cpu bound work. Commented May 7, 2019 at 17:23
  • 1
    But then your benchmark has no bearing on what you care about. Commented May 7, 2019 at 17:28

2 Answers 2

8

When comparing those benchmarks one should note that the asynchronous version is, well, asynchronous: asyncio spends a considerable effort to ensure that the coroutines you submit can run concurrently. In your particular case they don't actually run concurrently because process_usage doesn't await anything, but the system doesn't actually that. The synchronous version on the other hand makes no such provisions: it just runs everything sequentially, hitting the happy path of the interpreter.

A more reasonable comparison would be for the synchronous version to try to parallelize things in the way idiomatic for synchronous code: by using threads. Of course, you won't be able to create a separate thread for each process_usage because, unlike asyncio with its tasks, the OS won't allow you to create a million threads. But you can create a thread pool and feed it tasks:

def main():
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for key in range(0,1000000):
            executor.submit(process_usage, key)
        # at the end of "with" the executor automatically
        # waits for all futures to finish

On my system this takes ~17s, whereas the asyncio version takes ~18s. (The faster asyncio version takes ~13s.)

If the speed gain of asyncio is so small, one could ask why bother with asyncio? The difference is that with asyncio, assuming idiomatic code and IO-bound coroutines, you have at your disposal a virtually unlimited number of tasks that in a very real sense execute concurrently. You can create tens of thousands of asynchronous connections at the same time, and asyncio will happily juggle them all at once, using a high-quality poller and a scalable coroutine scheduler. With a thread pool the number of tasks executed in parallel is always limited by the number of threads in the pool, typically in the hundreds at most.

Even toy examples have value, for learning if nothing else. If you are using such microbenchmarks to make decisions, I suggest investing some more effort to give the examples more realism. The coroutine in the asyncio example should contain at least one await, and the sync example should use threads to emulate the same amount of parallelism you obtain with async. If you adjust both to match your actual use case, then the benchmark actually puts you in a position to make a (more) informed decision.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, this helped me better understand why this happens. My actual production function does an async write to redis with aioredis but I now understand the source of the overhead.
@Jonathan It would be interesting to examine your original problem in more detail. It's far from clear why parallel asyncio connections to redis would fare slower than the same number of sequential connections, except redis itself getting overwhelmed and undeperforming. Perhaps the performance of your code would be best improved through judicious use of semaphores or a queue feeding a fixed number of workers. Creating a huge number of concurrent tasks is possible in asyncio, but it doesn't mean that it's the optimal approach for every problem.
It basically read usage data for each key from a dict and if usage was above eg. 1000 then it would write the key to a rate_limit redis db. It's really that simple. The synchronous version takes 1s, I'm trying out a mix of synchronous for that and async for the rest of the script (to do a number of batch writes to dynamodb). Hope that helps.
2

Why does this happen?

TL;DR

Because using asyncio itself doesn't speedup code. You need multiple gathered network I/O related operations to see the difference toward synchronous version.

Detailed

asyncio is not a magic that allows you to speedup arbitrary code. With or without asyncio your code is still being run by CPU with limit performance.

asyncio is a way to manage multiple execution flows (coroutines) in a nice, clear way. Multiple execution flows allow you to start next I/O-related operation (such as request to database) before waiting for other one to be completed. Please read this answer for more detailed explanation.

Please also read this answer for explanation when it makes sense to use asyncio.

Once you start to use asyncio right way overhead for using it should be much lower than benefits you get for parallelizing I/O operations.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.