## A connection corresponds to a thread or process
# Common threading models
A thread/process handles all the messages from a fd and quits until the connection is closed. When the number of connections increases, the resources occupied by threads/processes and the cost of context switch will become increasingly large which causes poor performance. It is the source of the [C10K](http://en.wikipedia.org/wiki/C10k_problem) problem. These two methods(using thread or process) are common in early web servers and are rarely used today.
## Connections own threads or processes exclusively
## Single-threaded reactor
In this model, a thread/process handles all messages from a connection and does not quit or do other jobs before the connection is closed. When number of connections increases, resources occupied by threads/processes and costs of context switches becomes more and more overwhelming, making servers perform poorly. This situation is summarized as the [C10K](http://en.wikipedia.org/wiki/C10k_problem) problem, which was common in early web servers but rarely present today.
The event-loop library such as [libevent](http://libevent.org/)[, ](http://en.wikipedia.org/wiki/Reactor_pattern)[libev](http://software.schmorp.de/pkg/libev.html) is a typical example. Usually a event dispatcher is responsible for waiting different kinds of event and calls event handler in-place after an event happens. After handler is processed, dispatcher waits more events, from where "loop" comes from. Essentially all handler functions are executed in the order of occurrence in a system thread. One event-loop can use only one core, so this kind of program is either IO-bound or has a short and fixed running time(such as http servers). Otherwise one callback will block the whole program and causes high latencies. In practice this kind of program is not suitable for many people involved, because the performance may be significantly degraded if no enough attentions are paid. The extensibility of the event-loop program depends on multiple processes.
Event-loop libraries such as [libevent](http://libevent.org/), [libev](http://software.schmorp.de/pkg/libev.html) are typical examples. There's usually an event dispatcher in this model responsible for waiting on different kinds of events and calling the corresponding event handler **in-place** when an event occurs. After all handlers(that need to be called) are called, the dispatcher waits for more events again, which forms a "loop". Essentially this model multiplexes(interleaves) code written in different handlers into a system thread. One event-loop can only utilize one core, so this kind of program is either IO-bound or each handler runs within short and deterministic time(such as http servers), otherwise one callback taking long time blocks the whole program and causes high delays. In practice this kind of program is not suitable for involving many developers, because just one person adding inappropriate blocking code may significantly slow down reactivities of all other code. Since event handlers don't run simultaneously, race conditions between callbacks are relatively simple and in some scenarios locks are not needed. These programs are often scaled by deploying more processes.
How single-threaded reactors work and the problems related are demonstrated below: (The Chinese characters in red: "Uncontrollable! unless the service is specialized")
![img](../images/threading_overview_1.png)
## N:1 thread library
## N:1 threading library
Generally, N user threads are mapped into a system thread (LWP), and only one user thread can be run, such as [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). When the blocking function is called, current user thread is scheduled out. It is also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). N:1 thread library is equal to single-thread reactor. Event callback is replaced by an independent stack and registers, and running callbacks becomes jumping to the corresponding context. Since all user codes run in a system thread, N:1 thread library does not produce complex race conditions, and some scenarios do not need a lock. Because only one core can be used just like event loop library, N:1 thread library cannot give full play to multi-core performance, only suitable for some specific scenarios. But it also to reduce the jump between different cores, coupled with giving up the independent signal mask, context switch can be done quickly(100 ~ 200ns). Generally, the performance of N:1 thread library is as good as event loop and its extensibility also depends on multiple processes.
Also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). Typical examples are [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). This model maps N user threads into a single system thread, in which only one user thread runs at the same time and the running user thread does not switch to other user threads until a blocking primitive is called (cooperative). N:1 threading libraries are equal to single-threaded reactors on capabilities, except that callbacks are replaced by contexts (stacks, registers, signals) and running callbacks becomes jumping to contexts. Similar to event-loop libraries, a N:1 threading library cannot utilize multiple CPU cores, thus only suitable for specialized applications. However a single system thread is more friendly to CPU caches, with removal of the support for signal masks, context switches between user threads can be done very fast(100 ~ 200ns). N:1 threading libraries perform as well as event-loop libraries and are also scaled by deploying more processes in general.
## Multi-threaded reactr
## Multi-threaded reactor
Kylin, [boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html) is a typical example. Generally event dispatcher is run by one or several threads and schedules event handler to a worker thread after event happens. Since SMP machines are widely used in Baidu, the structure using multiple cores like this is more suitable and the method of exchanging messages between threads is simpler than that between processes, so it often makes multi-core load more uniform. However, due to cache coherence restrictions, the multi-threaded reactor model does not achieve linearity in the scalability of the core. In a particular scenario, the rough multi-threaded reactor running on a 24-core machine is not even faster than a single-threaded reactor with a dedicated implementation. Reactor has a proactor variant, namely using asynchronous IO to replace event dispatcher. Boost::asio is a proactor under [Windows](http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx).
[boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html) is a typical example. One or several threads run event dispatchers respectively. When an event occurs, the event handler is queued into a worker thread to run. This model is extensible from single-threaded reactor intuitively and able to make use of multiple CPU cores. Since sharing memory addresses makes interactions between threads much cheaper, the worker threads are able to balance loads between each other frequently, as a contrast multiple single-threaded reactors basically depend on the front-end servers to distribute traffic. A well-implemented multi-threaded reactor is likely to utilize CPU cores more evenly than multiple single-threaded reactors on the same machine. However, due to [cache coherence](atomic_instructions.md#cacheline), multi-threaded reactors are unlikely to achieve linear scalability on CPU cores. In particular scenarios, a badly implemented multi-threaded reactor running on 24 cores is even slower than a well-tuned single-threaded reactor. Because a multi-threaded reactor has multiple worker threads, one blocking event handler may not delay other handlers. As a result, event handlers are not required to be non-blocking unless all worker threads are blocked, in which case the overall progress is affected. In fact, most RPC frameworks are implemented in this model with event handlers that may block, such as synchronously waiting for RPCs to downstream servers.
The multi-threaded reactor works2 as shown blew:
How multi-threaded reactors work and problems related are demonstrated below:
![img](../images/threading_overview_2.png)
# What else can we improve?
## Extensibility is not good enough
Ideally, it is best for users to write event-driven codes, but in reality because of the difficulty of coding and maintainability, the using way of users is mostly mixed: synchronous IO is often issued in callbacks so that worker thread is blocked and it cannot process other requests. A request often goes through dozens of services, which means that a thread spent a lot of time waiting for responses from downstreams. Users often have to launch hundreds of threads to maintain high throughput, which resulted in high-intensity scheduling overhead. What's more, for simplicity, mutex and condition involved global contention is often used for distributing tasks. When all threads are in a highly contented state, the efficiency is clearly not high. A better approach is to use more task queues and corresponding scheduling algorithms to reduce global contention.
## Asynchronous programming is difficult
The code flow control in asynchronous programming is full of traps even for the experts. Any suspending operation(such as sleeping for a while or waiting for something to finish) implies that users should save states explicitly. Asynchronous code is often written in the form of state machines. When the position of suspension is less, it is a bit cumbersome, but still can be grasped. The problem is that once the suspending occurs in the conditional judgment, the loop or the sub-function, it is almost impossible to write such a state machine and be understood and maintained by many people, while it is quite common in distributed system since a node often wants to initiate operation on multiple other nodes at the same time. In addition, if the recovery can be triggered by a variety of events (such as fd has data or timeout happens), the process of suspend and resume is prone to have race conditions, which requires high ability of multi-threaded programming. Syntax sugar(such as lambda) can make coding less troublesome but can not reduce difficulty.
## M:N threading library
## RAII(http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) cannot be used in Asynchronous programming
This model maps M user threads into N system threads. A M:N threading library is able to decide when and where to run a piece of code and when to end the execution, which is more flexible at scheduling compared to multi-threaded reactors. But full-featured M:N threading libraries are difficult to implement and remaining as active research topics. The M:N threading library that we're talking about is specialized for building online services, in which case, some of the requirements can be simplified, namely no (complete) preemptions and priorities. M:N threading libraries can be implemented either in userland or OS kernel. New programming languages prefer implementations in userland, such as GHC thread and goroutine, which is capable of adding brand-new keywords and intercepting related APIs on threading. Implementation in existing languages often have to modify the OS kernel, such as [Windows UMS](https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx) and google SwicthTo(which is 1:1, however M:N effects can be achieved based on it). Compared to N:1 threading libraries, usages of M:N threading libraries are more similar to system threads, which need locks or message passings to ensure thread safety.
The more common way is to use a shared pointer, which seems convenient, but also makes the ownership of memory become elusive. If the memory leaked, it is difficult to locate the address not releasing; if segment fault happens, we also do not know the address releasing twice. It is difficult to control the quality of the code that uses the reference count extensively and it is easy to spend time on the memory problem for a long term. If the reference count also requires manual maintenance, keeping the quality of code is even more difficult(such as the code in kylin) and each modification will make the maintainers in a dilemma. No RAII also makes use of synchronization primitives more error-prone, for example, lock_guard cannot be used, locking outside callback and unlocking inside callback, which is prone to error in practice.
# Issues
## cache bouncing
## Multi-core scalability
When the event dispatcher dispatches a task to a worker, the user code has to jump from one core to another and the relevant cpu cache must be synchronized, which is a not very fast operation that takes several microseconds. If the worker can run directly on the core where the event dispatcher is running, since most systems(at this time scale) do not have intensive event flows, the priority of running the existing tasks is higher than getting new events from event dispatcher. Another example is that it is best to wake up the blocking thread that send the request in the same cpu core when response is received.
Ideally capabilities of the reactor model are maximized when all source code is programmed in event-driven manner, but in reality because of the difficulties and maintainability, users are likely to mix usages: synchronous IO is often issued in callbacks, blocking worker threads from processing other requests. A request often goes through dozens of services, making worker threads spend a lot of time waiting for responses from downstream servers. Users have to launch hundreds of threads to maintain enough throughput, which imposes intensive pressure on scheduling and lowers efficiencies of TLS related code. Tasks are often pushed into a queue protected with a global mutex and condition, which performs poorly when many threads are contending for it. A better approach is to deploy more task queues and adjust the scheduling algorithm to reduce global contentions. Namely each system thread has its own runqueue, and one or more schedulers dispatch user threads to different runqueues. Each system thread runs user threads from its own runqueue before considering other runqueues, which is more complicated but more scalable than the global mutex+condition solution. This model is also easier to support NUMA.
# M:N thread library
When an event dispatcher passes a task to a worker thread, the user code probably jumps from one CPU core to another, which may need to wait for synchronizations of relevant cachelines, which is not very fast. It would be better that the worker is able to run directly on the CPU core where the event dispatcher runs, since at most of the time, priority of running the worker ASAP is higher than getting new events from the dispatcher. Similarly, it's better to wake up the user thread blocking on RPC on the same CPU core where the response is received.
To meet these improvements we expect, one option is the M:N thread library, which maps M user threads into N system threads(LWP). Let us see how above problems are solved in this mode:
## Asynchronous programming
- Each system thread often has a separate runqueue. There may be one or more scheduler to distribute user thread to different runqueue and each system thread will run user thread in their own runqueue in a high priority, then do the global scheduling. This is certainly more complicated, but better than the global mutex/condition.
- Although the M:N thread library and the multi-threaded reactor are equivalent, the difficulty of coding in a synchronous way is significantly lower than that in an event-driven way and most people can quickly learn the method of synchronous programming.
- There is no need to split a function into several callbacks, you can use RAII.
- When switching from user thread A to user thread B, perhaps we can let B run on the core on which A is running, and let A run on another core, so that B with higher priority is less affected by cache miss.
Flow controls in asynchronous programming are even difficult for experts. Any suspended operation such as sleeping for a while or waiting for something to finish, implies that users have to save states explicitly and restore states in callbacks. Asynchronous code is often written as state machines. A few suspensions are troublesome, but still handleable. The problem is that once the suspension occurs inside a condition, loop or sub-function, it's almost impossible to write such a state machine being understood and maintained by many people, although the scenario is quite common in distributed systems where a node often needs to interact with multiple nodes simultaneously. In addition, if the wakeup can be triggered by more than one events (such as either fd has data or timeout is reached), the suspension and resuming are prone to race conditions, which require good multi-threaded programming skills to solve. Syntactic sugars(such as lambda) make coding less troublesome rather than reducing difficulty.
Implementing a full-featured M:N thread library is extremely difficult, so M:N thread library is always an active research topic. The M:N thread library we said here is especially for network services. In this context, some of the requirements can be simplified, for example, there are no time slice and priority. Even if there are some kinds of requirement, they are implemented using a simple way and can not be compared with the implementation of operating systems. M:N thread library can be implemented either in the user level or in the kernel level. New languages prefer the implementation in the user level, such as GHC threads and goroutine, which can design new APIs based on thread libraries. The implementation in the mainstream language often have to modify the kernel, such as Windows UMS(https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx). Although google SwicthTo is 1:1, M:N can be implemented based on it. M:N thread library is more similar to the system thread in usage, which needs locks or message passing to ensure thread safety of the code.
[RAII](http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) cannot be fully utilized in asynchronous programming. A more common method is to use shared pointers, which seems convenient, but also makes ownerships of memory elusive. If the memory is leaked, it's difficult to locate the code that forgot to release; if segment fault happens, where the double-free occurs is also unknown. Code with a lot of referential countings is hard to remain good-quality and may waste a lot of time on debugging memory related issues. If references are even counted manually, keeping quality of the code is harder and the maintainers are less willing to modify the code. Without RAII, locks need to be locked before a callback and unlocked inside the callback sometimes, which is very error-prone in practice.