We know that locks are extensively used in multi-thread programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. When a lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in practice and it is hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This artical tries to introduce some basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
[中文版](../cn/atomic_instructions.md)
As the name implies, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
We know that locks are extensively used in multi-threaded programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying shared data. When the lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in generally and it is even hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This article tries to introduce basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
As the name implies, atomic instructions cannot be divided into more sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
| Atomic Instructions(type of x is std::atomic\<int\>) | Descriptions |
| x.exchange(n) | set x to n and return the value just before the atomical set |
| x.exchange(n) | set x to n and return the value just before the modification |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, set x to desired and return true. Otherwise write current x to expected_ref and return false. |
| x.compare_exchange_weak(expected_ref, desired) | may have [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup) comparing to compare_exchange_strong |
| x.fetch_add(n), x.fetch_sub(n) | do x += n, x-= n atomically. Return the value just before the modification. |
You can already use these instructions to count stuff atomically, such as counting number of operations on resources used by multiple threads simultaneously. However two problems may arise:
You can already use these instructions to count something atomically, such as counting number of operations from multiple threads. However two problems may arise:
- The operation is not as fast as expected.
-If multi-threaded accesses to some resources are controlled by a few atomic operations that seem to be correct, the program still has great chance to crash.
-Even if multi-threaded accesses to some resources are controlled by a few atomic instructions that seem to be correct, the program still has great chance to crash.
# Cacheline
An atomic instruction is fast when there's no contentions or accessed by only one thread. "Contentions" happen when multiple threads access the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use caches and divide caches into multiple levels to get high performance with a low price. The Intel E5-2620 widely used in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is very fast for one core to write data into its own L1 cache(4 cycles, ~2ns), let the data in L1 cache be seen by other cores is not, because cachelines touched by the data need to be synchronized to other cores. This process is atomic and transparent to software and no instructions can be interleaved between. Applications have wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes much longer time than writing local cache. It involves a complicated algorithm and makes atomic instructions slow under high contentions. A single fetch_add may take more than 700ns in E5-2620 when a few threads are highly contented on the instruction. Accesses to the memory frequently shared and modified by multiple threads are not fast generally. For example, even if the critical section is small, using a spinlock may still not work well. The cause is that the instructions used in spinlock such as exchange, fetch_add etc, need to wait for latest cachelines. It's not surprising to see that one or two instructions take several microseconds.
An atomic instruction is fast when there's no contentions or it's accessed only by one thread. "Contentions" happen when multiple threads access the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use caches and divide caches into multiple levels to get high performance with a low price. The Intel E5-2620 widely used in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is very fast for one core to write data into its own L1 cache(4 cycles, ~2ns), make the data in L1 cache seen by other cores is not, because cachelines touched by the data need to be synchronized to other cores. This process is atomic and transparent to software and no instructions can be interleaved between. Applications have to wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes much longer time than writing local cache. It involves a complicated hardware algorithm and makes atomic instructions slow under high contentions. A single fetch_add may take more than 700ns in E5-2620 when a few threads are highly contented on the instruction. Accesses to the memory frequently shared and modified by multiple threads are not fast generally. For example, even if a critical section looks small, the spinlock protecting it may still not perform well. The cause is that the instructions used in spinlock such as exchange, fetch_add etc, need to wait for latest cachelines. It's not surprising to see that one or two instructions take several microseconds.
In order to improve performance, we need to avoid frequently synchronizing cachelines, which not only affects performance of the atomic instruction itself, but also overall performance of the program. The most effective solution is straightforward: **avoid sharing as much as possible**.
In order to improve performance, we need to avoid synchronizing cachelines frequently, which not only affects performance of the atomic instruction itself, but also overall performance of the program. The most effective solution is straightforward: **avoid sharing as much as possible**. Avoiding contentions from the very beginning is the best strategy:
- A program relying on a global multiple-producer-multiple-consumer(MPMC) queue is hard to scale well on many CPU cores, because throughput of the queue is limited by delays of cache coherence, rather than number of cores. It would be better to use multiple SPMC or MPSC queues, or even SPSC queues instead, to avoid contentions from the very beginning.
- Another example is counters. If all threads modify a counter frequently, the performance will be poor because all cores are busy synchronizing the same cacheline. If the counter is only used for printing logs periodically or something low-priority like that, we can let each thread modify its own thread-local variables and combine all thread-local data before reading, yielding [much better performance](bvar.md).
- A program relying on a global multiple-producer-multiple-consumer(MPMC) queue is hard to scale well on many cpu cores, since throughput of the queue is limited by delays of cache coherence, rather than the number of cores. It would be better to use multiple SPMC or MPSC queues, or even SPSC queues instead, to avoid contentions from the beginning.
- Another example is global counter. If all threads modify a global variable frequently, the performance will be poor because all cores are busy at synchronizing the same cacheline. If the counter is only used for printing logs periodically or something like that, we can let each thread modify its own thread-local variables and combine all thread-local data for a read, resulting a [much better performance](bvar.md).
A related programming trap is false sharing: Accesses to infrequently updated or even read-only variables are significantly slowed down because other variables in the same cacheline are frequently updated. Variables used in multi-threaded environment should be grouped by accessing frequencies or patterns, variables that are modified by that other threads frequently should be put into separated cachelines. To align a variable or struct by cacheline, `include <butil/macros.h>` and tag it with macro `BAIDU_CACHELINE_ALIGNMENT`, grep source code of brpc for examples.
# Memory fence
Only using atomic addition cannot achieve access control to resources, codes that seem correct may crash as well even for simple [spinlocks](https://en.wikipedia.org/wiki/Spinlock) or [reference count](https://en.wikipedia.org/wiki/Reference_counting). The key point here is that **instruction reorder** change the order of write and read. The instructions(including visiting memory) behind can be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may do this reordering. The motivation is very natural: cpu should be filled with instructions in every cycle to execute as many as possible instructions in unit time. For example,
Just atomic counting cannot synchronize accesses to resources, simple structures like [spinlock](https://en.wikipedia.org/wiki/Spinlock) or [reference counting](https://en.wikipedia.org/wiki/Reference_counting) that seem correct may crash as well. The key is **instruction reordering**, which may change the order of read/write and cause instructions behind to be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may reorder.
The motivation is natural: CPU wants to fill each cycle with instructions and execute as many as possible instructions within given time. As above section says, an instruction for loading memory may cost hundreds of nanoseconds for synchronizing the cacheline. A efficient solution to synchronize multiple cachelines is to move them simultaneously rather than one-by-one. Thus modifications to multiple variables by a thread may be visible to another thread in a different order. On the other hand, different threads need different data, synchronizing on-demand is reasonable and may also change order between cachelines.
If the first variable plays the role of switch, controlling accesses to following variables. When these variables are synchronized to other CPU cores, new values may be visible in a different order, and the first variable may not be the first one updated, which causes other threads to think that the following variables are still valid, which are actually deleted, causing undefined behavior. For example:
```c++
// Thread 1
...
...
@@ -42,16 +50,16 @@ if (ready) {
p.bar();
}
```
From the view of human, this code seems correct because thread2 will access `p` only when `ready` is true and that happens after the initilization of p according to thread1. But for multi-core machines, this code may not run as expected:
From a human perspective, the code is correct because thread2 only accesses `p` when `ready` is true which means p is initialized according to logic in thread1. But the code may not run as expected on multi-core machines:
-`ready = true` in thread1 may be reordered to the position before `p.init()` by compiler or cpu, then when thread2 has seen that `ready` is true, `p` is still not initialized.
- Even if there is no reordering, the value of `ready` and `p` may be synchronized to the cache of core that thread2 runs independently, thread2 may still call `p.bar()` but `p` is not initialized when `ready` is true. The same situation may happens for thread2 as well. For example, some instructions may be reordered to the position before checking `ready`.
-`ready = true` in thread1 may be reordered before `p.init()` by compiler or CPU, making thread2 see uninitialized `p` when `ready` is true. The same reordering may happen in thread2 as well. Some instructions in `p.bar()` may be reordered before checking `ready`.
- Even if the above reordering did not happen, cachelines of `ready` and `p` may be synchronized independently to the CPU core that thread2 runs, making thread2 see unitialized `p` when `ready` is true.
Note: In x86, `load` has acquire semantic, and `store` release semantic, so the code above can run correctly if not considering the reordering done by compiler and cpu.
Note: On x86/x64, `load` has acquire semantics and `store` has release semantics by default, the code above may run correctly provided that reordering by compiler is turned off.
With this simple example, you can get a glimpse of the complexity of programming using atomic instructions. In order to solve this problem, you can use [memory fence](http://en.wikipedia.org/wiki/Memory_barrier) to let programmer decide the visibility relationship between instructions. Boost and C++11 makes an abstraction of memory fence, which can be concluded as following several memory order:
With this simple example, you may get a glimpse of the complexity of programming using atomic instructions. In order to solve the reordering issue, CPU and compiler offer [memory fences](http://en.wikipedia.org/wiki/Memory_barrier) to let programmers decide the visibility order between instructions. boost and C++11 conclude memory fence into following types:
| memory_order_relaxed | there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed |
| memory_order_consume | no reads or writes in the current thread dependent on the value currently loaded can be reordered before this load. |
...
...
@@ -60,7 +68,7 @@ With this simple example, you can get a glimpse of the complexity of programming
| memory_order_acq_rel | No memory reads or writes in the current thread can be reordered before or after this store. |
| memory_order_seq_cst | Any operation with this memory order is both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order. |
Using memory order, above example can be modified as such:
Above example can be modified as follows:
```c++
// Thread1
...
...
@@ -76,24 +84,24 @@ if (ready.load(std::memory_order_acquire)) {
}
```
The acquire in thread2 is matched with the release in thread1, making sure that thread2 can see all memory operations before release operation in thread1 when `ready` is equal to true in thread2.
The acquire fence in thread2 matches the release fence in thread1, making thread2 see all memory operations before the release fence in thread1 when thread2 sees `ready` being set to true.
Notice that memory fence is not equal to visibility. Even though thread2 read the value of ready just after thread1 set it to true, it cannot be guaranteed that thread2 read the newest value, which is caused by the delay of cache synchronization. Memory fence only guarantees the order of visibility: if the program can read the newest value of a, then it can also read the newest value of b. Why does cpu not notify the program when the newest data is ready? First, the delay of read will be increased. Second, read is busy synchronizing when there are plenty of write, making read starve. What's more, Even if the program has read the newest value, this value could be changed when modification instruction is issued, making this policy meaningless.
Note that memory fence does not guarantee visibility. Even if thread2 reads `ready` just after thread1 sets it to true, thread2 is not guaranteed to see the new value, because cache synchronization takes time. Memory fence guarantees order of visibilities: "If I see the new value of a, I should see the new value of b as well".
Another problem is that if the program read the old value, then nothing should be done, but how does the program know whether it is the old value or the new value? Generally there are two cases:
A related problem is that: How to know whether a value is new or old? Two cases in generally:
-Value is special. In above example, `ready=true` is a special value. Reading true from ready means that it is the new value. In this case, every special value has a meaning.
-Always Accumulate. In some situations, there are no special values, we can use instructions like `fetch_add` to accumulate variables. As long as the range of value is big enough, the new value is different from the old value in a long period of time so that we can distinguish them from each other.
-The value is special. In above example, `ready=true` is a special value. Once `ready` is true, `p` is ready. Reading special values or not both mean something.
-Increasing-only values. Some situations do not have special values, we can use instructions like `fetch_add` to increase variables. As long as the value range is large enough, new values are different from old ones for a long period of time, so that we can distinguish them from each other.
More examples can be found [here](http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html) in boost.atomic. The official description of atomic can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
More examples can be found in [boost.atomic](http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html). Official descriptions of atomic can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
# wait-free & lock-free
Atomic instructions can provide us two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)和[lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how os(operating system) schedules threads, every thread is doing useful work; lock-free is weaker than wait-free, which means no matter how os schedules threads, at least one thread is doing useful work. If locks are used in the program, then os may schedule out the thread holding the lock in which case all threads trying to hold the same lock is waiting. So Using locks are not lock-free and wait-free. To make sure one task is done always within a determined time, the critical path in real-time os is at least lock-free. In our extensive and diverse online service, the property of real-time is demanded eagerly. If the most critical path in brpc satisfies wait-free of lock-free, we can provide a more stable quality of service. For example, since [fd](https://en.wikipedia.org/wiki/File_descriptor) is only suitable for being manipulated by a single thread, atomic instructions are used in brpc to maximize the concurrency of read and write of fd which is discussed more deeply in [here](io.md).
Atomic instructions provide two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) and [lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how OS schedules, all threads are doing useful jobs; lock-free, which is weaker than wait-free, means no matter how OS schedules, at least one thread is doing useful jobs. If locks are used, the thread holding the lock might be swapped out by OS, in which case all threads trying to hold the lock are blocked. So code using locks are neither lock-free nor wait-free. To make tasks done within given time, critical paths in real-time OS is at least lock-free. Miscellaneous online services inside Baidu also pose serious restrictions to running time. If the critical path in brpc is wait-free or lock-free, many services are benefited by better and stable QoS. Actually, both read(in the sense of even dispatching) and write in brpc are wait-free, check [IO](io.md) for more.
please notice that it is common to think that algorithms using wait-free or lock-free could be faster, but the truth may be the opposite, because:
Note that it is common to think that wait-free or lock-free algorithms are faster, which may not be true, because:
-Race condition and ABA problem must be handled in lock-free and wait-free algorithms, which means the code may be more complex than that using locks when doing the same task.
-Using mutex has an effect of backoff. Backoff means that when contention happens, it will find another way to avoid fierce contention. The thread getting an locked mutex will be put into sleep state to avoid cacheline synchronization, making the thread holding that mutex can complete the task quickly, which may increase the overall throughput.
-More complex race conditions and ABA problems must be handled in lock-free and wait-free algorithms, which means the code is often much more complicated than the one using locks. More code, more running time.
-Mutex solves contentions by backoff, which means that when contention happens, another way is chosen to avoid the contention temporarily. Threads failed to lock a mutex are put into sleep, making the thread holding the mutex complete the task or even following several tasks exclusively, which may increase the overall throughput.
The low performance caused by mutex is because either the critical area is too big(which limits the concurrency), or the critical area is too small(the overhead of context switch becomes prominent, adaptive mutex should be considered to use). The value of lock-free/wait-free is that it guarantees one thread or all threads always doing useful work, not for absolute high performance. But algorithms using lock-free/wait-free may probably have better performance in the situation where algorithm itself can be implemented using a few atomic instructions. Atomic instructions are also used to implement mutex, so if the algorithm can be done just using one or two atomic instructions, it could be faster than that using mutex.
Low performance caused by mutex is either because of too large critical sections (which limit the concurrency), or too heavy contentions (overhead of context switches becomes dominating). The real value of lock-free/wait-free algorithms is that they guarantee progress of one thread or all threads, rather than absolutely high performance. Of course lock-free/wait-free algorithms perform better in some situations: if an algorithm is implemented by just one or two atomic instructions, it's probably faster than the one using mutex which is alos implemented by atomic instructions.
With the growth of the number of business products, the access pattern to downstream becomes increasingly complicate, which often contains multiple simultaneous RPCs or subsequent asynchronous ones. However, these could easily introduce very tricky bugs under multi-thread environment, of which users may not even aware, and it's also difficult to debug and reproduce. Moreover, implementations may not provide full support for various access patterns, in which case you have to write your own. Take semi-synchronous RPC as an example, which means waiting for multiple asynchronous RPCs to complete. A common implementation for synchronous access would be issuing multiple requests asynchronously and waiting for their completion, while the implementation for asynchronous access makes use of a callback with a counter. Each time an asynchronous RPC finishes, the counter decrement itself until zero in which case the callback is called. Now let's analyze their weakness:
With the growth of the number of business products, the access pattern to downstream becomes increasingly complicate, which often contains multiple simultaneous RPCs or subsequent asynchronous ones. However, these could easily introduce very tricky bugs under multi-threaded environment, of which users may not even aware, and it's also difficult to debug and reproduce. Moreover, implementations may not provide full support for various access patterns, in which case you have to write your own. Take semi-synchronous RPC as an example, which means waiting for multiple asynchronous RPCs to complete. A common implementation for synchronous access would be issuing multiple requests asynchronously and waiting for their completion, while the implementation for asynchronous access makes use of a callback with a counter. Each time an asynchronous RPC finishes, the counter decrement itself until zero in which case the callback is called. Now let's analyze their weakness:
- The code is inconsistent between synchronous pattern and asynchronous one. It's difficult for users to move from one pattern to another. From the design point of view, inconsistencies suggest lose of essence.
- Cancellation is not supported in general. It's not easy to cancel an RPC in time correctly, let alone a combination of access. Most implementations do not support cancellation of a combo access. However, it's a must for some speed up technique such as backup request.
Under the production environment, we will gradually increase the number of instance on 4-partition scheme while terminating instance on 3-partition scheme. `DynamicParititonChannel` can divide the traffic based on the capacity of all partitions dynamically. When the capacity of 3-partition scheme drops down to 0, then we've smoothly migrated all the servers from 3-partition scheme to 4-partition one without changing the client's code.
\ No newline at end of file
Under the production environment, we will gradually increase the number of instance on 4-partition scheme while terminating instance on 3-partition scheme. `DynamicParititonChannel` can divide the traffic based on the capacity of all partitions dynamically. When the capacity of 3-partition scheme drops down to 0, then we've smoothly migrated all the servers from 3-partition scheme to 4-partition one without changing the client's code.
Generally there are three mechanisms to operate IO:
There are three mechanisms to operate IO in general:
-blocking IO: once an IO operation is issued, the current thread is blocked until the IO ends, which is a kind of synchronous IO, such as the default action of posix [read](http://linux.die.net/man/2/read) and [write](http://linux.die.net/man/2/write).
-non-blocking IO: If there is nothing to read or too much to write, APIs that would block return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
-asynchronous IO: Start a read/write operation with a callback, which will be called when the IO is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux only supports files.
-Blocking IO: once an IO operation is issued, the calling thread is blocked until the IO ends, which is a kind of synchronous IO, such as default actions of posix [read](http://linux.die.net/man/2/read) and [write](http://linux.die.net/man/2/write).
-Non-blocking IO: If there is nothing to read or too much to write, APIs that would block return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
-Asynchronous IO: Start a read or write operation with a callback, which will be called when the IO is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux only supports files.
non-blocking IO is usually used for increasing IO concurrency in Linux. When the IO concurrency is low, non-blocking IO is not necessarily more efficient than blocking IO, which is handled completely by the kernel. System calls like read/write are highly optimized and more efficient. But when IO concurrency increases, the drawback of blocking-one-thread in blocking IO arises: the kernel keeps switching between threads to do effective jobs, and a cpu core may only do a little bit of work before being replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads decrease performance of code dependent on thread-local variables, such as tcmalloc. Once malloc slows down, the overall performance of the program decreases as well. As a contrast, non-blocking IO is typically composed with a relatively small number of event dispatching threads and worker threads(running user code), which are often reused by different tasks (in another word, part of scheduling work is moved to userland). Event dispatchers and workers can run on different cpu cores simultaneously to do the job without frequent switches in the kernel. There is no need to have many threads, so the use of thread-local variables is also more adequate. All these factors make non-blocking IO faster than blocking IO. But non-blocking IO also has its own problems, one of which is more system calls, such as [epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since epoll is implemented as a red-black tree, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations heavily dependent on epoll_ctl is often confronted with multi-core scalability issues. non-blocking IO also has to solve a lot of multi-threaded problems, producing more complex code than blocking IO.
Non-blocking IO is usually used for increasing IO concurrency in Linux. When the IO concurrency is low, non-blocking IO is not necessarily more efficient than blocking IO, which is handled completely by the kernel. System calls like read/write are highly optimized and probably more efficient. But when IO concurrency increases, the drawback of blocking-one-thread in blocking IO arises: the kernel keeps switching between threads to do effective jobs, and a CPU core may only do a little bit of work before being replaced by another thread, causing CPU cache not fully utilized. In addition a large number of threads decrease performance of code dependent on thread-local variables, such as tcmalloc. Once malloc slows down, the overall performance of the program decreases as well. As a contrast, non-blocking IO is typically composed with a relatively small number of event dispatching threads and worker threads(running user code), which are reused by different tasks (in another word, part of scheduling work is moved to userland). Event dispatchers and workers run on different CPU cores simultaneously without frequent switches in the kernel. There's no need to have many threads, so the use of thread-local variables is also more adequate. All these factors make non-blocking IO faster than blocking IO. But non-blocking IO also has its own problems, one of which is more system calls, such as [epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since epoll is implemented as a red-black tree, epoll_ctl is not a very fast operation, especially in multi-threaded environments. Implementations heavily dependent on epoll_ctl is often confronted with multi-core scalability issues. Non-blocking IO also has to solve a lot of multi-threaded problems, producing more complex code than blocking IO.
# Receiving messages
A message is a bounded binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) to wait for events from file descriptors. Unlike the common "IO threads", EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some reads are delayed when many fds in one IO thread are busy. Multi-tenancy, complicated load balancing and [Streaming RPC](streaming_rpc.md) make the problem worse. Under high workloads, regular long delays from a fd may slow down reads from all other fds in the IO thread, impacting usability greater.
A message is a bounded binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) to wait for events from file descriptors. Unlike the common "IO threads", EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, other reads may be delayed when many fds in one IO thread are busy. Multi-tenancy, complicated load balancing and [Streaming RPC](streaming_rpc.md) worsen the problem. Under high workloads, regular long delays on one fd may slow down all fds in the IO thread, causing more long tails.
Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll (at the time of developing brpc) and overhead of epoll_ctl, edge triggered mode is used in EDISP. After receiving an event, an atomic variable associated with the fd is added by one atomically. If the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread worker in which EDISP runs is yielded to the newly created bthread to make it better at cache locality and start reading ASAP. The bthread in which EDISP runs will be stolen to another pthread and keep running, this mechanism is work stealing used in bthreads. To understand exactly how that atomic variable works, you can read [atomic instructions](atomic_instructions.md) first, then check [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions on dispatching events of one fd be[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll (at the time of developing brpc) and overhead of epoll_ctl, edge triggered mode is used in EDISP. After receiving an event, an atomic variable associated with the fd is added by one atomically. If the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread worker in which EDISP runs is yielded to the newly created bthread to make it start reading ASAP and have a better cache locality. The bthread in which EDISP runs will be stolen to another pthread and keep running, this mechanism is work stealing used in bthreads. To understand exactly how that atomic variable works, you can read [atomic instructions](atomic_instructions.md) first, then check [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions on dispatching events of one fd[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h) cuts messages and uses customizable callbacks to handle different format of data. `Parse` callback cuts messages from binary data and has relatively stable running time; `Process` parses messages further(such as parsing by protobuf) and calls users' callbacks, which vary in running time. If n(n > 1) messages are read from the fd, InputMessenger launches n-1 bthreads to handle first n-1 messages respectively, and processes the last message in-place. InputMessenger tries protocols one by one. Since one connections often has only one type of messages, InputMessenger remembers current protocol to avoid trying for protocols next time.
It can be seen that messages from different fds or even same fd are processed concurrently in brpc, which makes brpc be good at handling large messages and reducing long tails on processing messages from different sources under high workloads.
It can be seen that messages from different fds or even same fd are processed concurrently in brpc, which makes brpc good at handling large messages and reducing long tails on processing messages from different sources under high workloads.
Note that the actual commands processed per second of redis-server is 10 times the QPS value, which is about 400K. When thread_num equals 50 or higher, the CPU usage of the redis-server reaches its limit. Since redis-server runs in [single-thread reactor mode](threading_overview.md#单线程reactor), 99.9% on one core is the maximum CPU it can use.
Note that the actual commands processed per second of redis-server is 10 times the QPS value, which is about 400K. When thread_num equals 50 or higher, the CPU usage of the redis-server reaches its limit. Since redis-server runs in [single-threaded reactor mode](threading_overview.md#单线程reactor), 99.9% on one core is the maximum CPU it can use.
Now start a client to send requests to redis-server from the same machine using 50 bthreads synchronously through connection pool.
A thread/process handles all the messages from a fd and quits until the connection is closed. When the number of connections increases, the resources occupied by threads/processes and the cost of context switch will become increasingly large which causes poor performance. It is the source of the [C10K](http://en.wikipedia.org/wiki/C10k_problem) problem. These two methods(using thread or process) are common in early web servers and are rarely used today.
## Single-thread reactor
## Single-threaded reactor
The event-loop library such as [libevent](http://libevent.org/)[, ](http://en.wikipedia.org/wiki/Reactor_pattern)[libev](http://software.schmorp.de/pkg/libev.html) is a typical example. Usually a event dispatcher is responsible for waiting different kinds of event and calls event handler in-place after an event happens. After handler is processed, dispatcher waits more events, from where "loop" comes from. Essentially all handler functions are executed in the order of occurrence in a system thread. One event-loop can use only one core, so this kind of program is either IO-bound or has a short and fixed running time(such as http servers). Otherwise one callback will block the whole program and causes high latencies. In practice this kind of program is not suitable for many people involved, because the performance may be significantly degraded if no enough attentions are paid. The extensibility of the event-loop program depends on multiple processes.
...
...
@@ -16,7 +16,7 @@ The single-threaded reactor works as shown below:
Generally, N user threads are mapped into a system thread (LWP), and only one user thread can be run, such as [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). When the blocking function is called, current user thread is scheduled out. It is also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). N:1 thread library is equal to single-thread reactor. Event callback is replaced by an independent stack and registers, and running callbacks becomes jumping to the corresponding context. Since all user codes run in a system thread, N:1 thread library does not produce complex race conditions, and some scenarios do not need a lock. Because only one core can be used just like event loop library, N:1 thread library cannot give full play to multi-core performance, only suitable for some specific scenarios. But it also to reduce the jump between different cores, coupled with giving up the independent signal mask, context switch can be done quickly(100 ~ 200ns). Generally, the performance of N:1 thread library is as good as event loop and its extensibility also depends on multiple processes.
## Multi-thread reactor
## Multi-threaded reactr
Kylin, [boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html) is a typical example. Generally event dispatcher is run by one or several threads and schedules event handler to a worker thread after event happens. Since SMP machines are widely used in Baidu, the structure using multiple cores like this is more suitable and the method of exchanging messages between threads is simpler than that between processes, so it often makes multi-core load more uniform. However, due to cache coherence restrictions, the multi-threaded reactor model does not achieve linearity in the scalability of the core. In a particular scenario, the rough multi-threaded reactor running on a 24-core machine is not even faster than a single-threaded reactor with a dedicated implementation. Reactor has a proactor variant, namely using asynchronous IO to replace event dispatcher. Boost::asio is a proactor under [Windows](http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx).