Whenever we write backend systems, we use different frameworks that manage the async runtime for us. These frameworks can use various techniques to ensure our service is able scale to many concurrent clients and process requests in paralle. This blog details the inner workings of these runtime and how they are able to process multiple requests asynchornously.
Why do we want concurrency ?
Imagine we have a backend server running on a single core machine. Ideally, we would want the server to accept multiple connections from differnet clients, handle their requests and respond back as soon as the requests complete without any interuptions. If we do not enable concurrency, i.e. the requests not being multiplexed on to the single core, all the connections and request handling would end up happening sequentially, which is not really great for our clients. If we are doing IO work especially, our single core machine also spends most of its time waiting for the IO operation to complete instead of attempting to serve other incoming requests. Therefore, in order to efficiently use the compute and serve requests in parallel, concurrency is a requirement.
How can we take advantage of concurrency to process requests in parallel ?
In general, there are two different ways to schedule IO work on the CPU in order to process them concurrenlty: 1) A thread-per-request or a fixed number of threads (thread-pool) performing blocking IO 2) Non-blocking IO using an event loop.
1. Blocking IO with thread pool
Blocking IO basically means that the thread that is doing the work is blocked on the IO operation and cannot proceed until it completes. Take the following Python code which reads a record from a MongoDB collection
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client["testdb"]
users = db["users"]
user = users.find_one({"id": user_id})
print("User:", user)
This is the synchronous version of reading data from a collection where calling find_one
blocks the thread from proceeding further, i.e. it cannot
perform any other task such as handling another request etc. However, if we want to perform these operation in parallel, we can schedule
each one of the operation on a separate thread so that they can happen in parallel. Check the following code:
from pymongo import MongoClient
from concurrent.futures import ThreadPoolExecutor, as_completed
client = MongoClient("mongodb://localhost:27017")
db = client["testdb"]
users = db["users"]
def read_user(user_id):
user = users.find_one({"id": user_id})
print(f"User {user_id}:", user)
return user
user_ids = [1, 2, 3, 4, 5, 6, 7, 8]
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(read_user, user_id) for user_id in user_ids]
for future in as_completed(futures):
_ = future.result()
Here, the ThreadPoolExecutor
runs each one of the read operation on a OS thread. The OS thread is then scheduled on to the CPU to be executed by the OS kernel.
Each OS thread shares the compute and are allocated an amount of time on the CPU to execute their action, i.e. call MongoDB to read a record. Similar to the above
example, any requests coming to our backend service can be submitted to a ThreadPoolExecutor
so that they can be scheduled onto the CPU for execution in parallel.
Increasing the number of workers in the above code, increases the number of OS threads that are active and as a result increases the throughput of your service.
This setup however, does have some significant downsides:
- It leads to an increase in the memory footprint of the service as we add more OS threads to improve throughput. Each OS thread has a default stack size of around 8MB (in Linux but depends on OS) and also requires additional storage for maintaining metadata. If you are using thead-local data, this can also significantly increase memory as you scale the number of threads to handle more concurrency
- Incerasing the number of OS threads leads to more context switching when scheduling these threads on the CPU. This is a problem because everytime a context switch occurs, the state of the scheduled thread must be saved from the CPU registers and the state of the new thread must be loaded into the registers.
- With high latency IO operations, the OS threads are just blocked, wasting memory resources.
2. Non-blocking IO
As it is obvious from the name, in non-blocking IO the thread is not blocked on an IO operation, it can continue to serve other requests and execute async operations. This is made possible by the event loop. The event loop is a pattern that leverages a kernel interface like epoll/kqueue/io_uring to allow systems to wait and react for IO events without blocking. Take the same example as above, i.e. retrieving a record from a MongoDB collection but leveraging the event loop to make it non-blocking:
import asyncio
from motor.motor_asyncio import AsyncIOMotorClient
async def read_user(user_id):
client = AsyncIOMotorClient("mongodb://localhost:27017")
db = client["testdb"]
users = db["users"]
user = await users.find_one({"id": user_id})
print("User:", user)
asyncio.run(read_user(1))
Python uses asyncio, a library that allows you to write coroutines using the aync/await syntax. These coroutines are managed
by the event loop and are multiplexed onto a single OS thread. In the above code, read_user
is a coroutine that is scheduled on to the event loop. As soon as the execution reaches
the await
statement, the corountine yields control back to the event loop. The event loop can now process other tasks such as serving other requests from clients or making additional
MongoDB calls. Once the IO operation completes, the kernel interface (epoll/kqueue/io_uring) notifies the event loop which resumes the coroutine, i.e. executes the line after the await
and until the next await
statement is reached and the process repeats (checkout this blog on how event loop works in JS).
In this event-driven architecture, the memory footprint of your system is less since you need only one OS thread to execute tasks concurrently and all async requests are multiplexed onto the same thread so there are no overheads due to context switching. However, the major problems with single-threaded event loop systems are:
- Any type of CPU bound work halts the entire event loop. For example, if you are de-serializing a big JSON object, it halts the event loop from checking the status of other coroutines or processing other requests.
- Harder to take advantage of multi-core machines due to Python's GIL (Global Interpreter Lock). These two problems are mostly language specific, e.g. Python and JavaScript. Other languages like Rust use a slightly different mechanism to overcome this problem.
Green Threads - these are threads that are managed by the language's runtime instead of OS. Therefore, their memory footprint is much smaller (few KBs) and are scheduled by the language's runtime. The green threads are multiplexed onto the OS thread and each core runs a single OS thread. Therefore, adding more cores to your machine will directly translate to more throughput and CPU bound work doesn't entirely bring the event loop to a halt (One thing to keep in mind: the event loop is not really designed for CPU bound work, using it primarily for this is just a misuse, also generally CPU bound work is synchronous and in which case you probably don't need an event loop). Rust's most popular runtime Tokio uses an event loop per OS thread and the green threads are scheduled across the OS thread pool. This provides a significant improvement in throughput while maintaining a small memory footprint.
Conclusion
Choosing between thread pools and event loops for concurrent I/O depends on your specific requirements and constraints. Thread pools excel in scenarios where you need to handle mixed workloads (both I/O and CPU-bound tasks) and want to leverage multiple cores effectively, but they come with higher memory overhead and context switching costs. Event loops shine for I/O-heavy workloads with their minimal memory footprint and efficient single-threaded execution, but struggle with CPU-bound tasks and multi-core utilization in languages like Python and JavaScript. Modern runtimes like Tokio represent the evolution of these patterns, combining the best of both worlds through green threads and multi-threaded event loops to achieve high throughput while maintaining memory efficiency.