We know that locks are extensively used in multi-thread programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying shared data. When the lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in generally and it is even hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This article tries to introduce basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
We know that locks are extensively used in multi-threaded programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying shared data. When the lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in generally and it is even hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This article tries to introduce basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
As the name implies, atomic instructions cannot be divided into more sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
...
...
@@ -31,7 +31,7 @@ A related programming trap is false sharing: Accesses to infrequently updated or
# Memory fence
Just atomic counting cannot synchronize accesses to resources, simple structures like [spinlock](https://en.wikipedia.org/wiki/Spinlock) or [reference counting](https://en.wikipedia.org/wiki/Reference_counting) that seem correct may crash as well. The key is **instruction reordering**, which may change the order of read/write and cause instructions behind to be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may reorder.
Just atomic counting cannot synchronize accesses to resources, simple structures like [spinlock](https://en.wikipedia.org/wiki/Spinlock) or [reference counting](https://en.wikipedia.org/wiki/Reference_counting) that seem correct may crash as well. The key is **instruction reordering**, which may change the order of read/write and cause instructions behind to be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may reorder.
The motivation is natural: CPU wants to fill each cycle with instructions and execute as many as possible instructions within given time. As above section says, an instruction for loading memory may cost hundreds of nanoseconds for synchronizing the cacheline. A efficient solution to synchronize multiple cachelines is to move them simultaneously rather than one-by-one. Thus modifications to multiple variables by a thread may be visible to another thread in a different order. On the other hand, different threads need different data, synchronizing on-demand is reasonable and may also change order between cachelines.
...
...
@@ -104,4 +104,4 @@ Note that it is common to think that wait-free or lock-free algorithms are faste
- More complex race conditions and ABA problems must be handled in lock-free and wait-free algorithms, which means the code is often much more complicated than the one using locks. More code, more running time.
- Mutex solves contentions by backoff, which means that when contention happens, another way is chosen to avoid the contention temporarily. Threads failed to lock a mutex are put into sleep, making the thread holding the mutex complete the task or even following several tasks exclusively, which may increase the overall throughput.
Low performance caused by mutex is either because of too large critical sections (which limit the concurrency), or too heavy contentions (overhead of context switches becomes dominating). The real value of lock-free/wait-free algorithms is that they guarantee progress of one thread or all threads, rather than absolutely high performance. Of course lock-free/wait-free algorithms perform better in some situations: if an algorithm is implemented by just one or two atomic instructions, it's probably faster than the one using mutex which is alos implemented by atomic instructions.
\ No newline at end of file
Low performance caused by mutex is either because of too large critical sections (which limit the concurrency), or too heavy contentions (overhead of context switches becomes dominating). The real value of lock-free/wait-free algorithms is that they guarantee progress of one thread or all threads, rather than absolutely high performance. Of course lock-free/wait-free algorithms perform better in some situations: if an algorithm is implemented by just one or two atomic instructions, it's probably faster than the one using mutex which is alos implemented by atomic instructions.
With the growth of the number of business products, the access pattern to downstream becomes increasingly complicate, which often contains multiple simultaneous RPCs or subsequent asynchronous ones. However, these could easily introduce very tricky bugs under multi-thread environment, of which users may not even aware, and it's also difficult to debug and reproduce. Moreover, implementations may not provide full support for various access patterns, in which case you have to write your own. Take semi-synchronous RPC as an example, which means waiting for multiple asynchronous RPCs to complete. A common implementation for synchronous access would be issuing multiple requests asynchronously and waiting for their completion, while the implementation for asynchronous access makes use of a callback with a counter. Each time an asynchronous RPC finishes, the counter decrement itself until zero in which case the callback is called. Now let's analyze their weakness:
With the growth of the number of business products, the access pattern to downstream becomes increasingly complicate, which often contains multiple simultaneous RPCs or subsequent asynchronous ones. However, these could easily introduce very tricky bugs under multi-threaded environment, of which users may not even aware, and it's also difficult to debug and reproduce. Moreover, implementations may not provide full support for various access patterns, in which case you have to write your own. Take semi-synchronous RPC as an example, which means waiting for multiple asynchronous RPCs to complete. A common implementation for synchronous access would be issuing multiple requests asynchronously and waiting for their completion, while the implementation for asynchronous access makes use of a callback with a counter. Each time an asynchronous RPC finishes, the counter decrement itself until zero in which case the callback is called. Now let's analyze their weakness:
- The code is inconsistent between synchronous pattern and asynchronous one. It's difficult for users to move from one pattern to another. From the design point of view, inconsistencies suggest lose of essence.
- Cancellation is not supported in general. It's not easy to cancel an RPC in time correctly, let alone a combination of access. Most implementations do not support cancellation of a combo access. However, it's a must for some speed up technique such as backup request.
Under the production environment, we will gradually increase the number of instance on 4-partition scheme while terminating instance on 3-partition scheme. `DynamicParititonChannel` can divide the traffic based on the capacity of all partitions dynamically. When the capacity of 3-partition scheme drops down to 0, then we've smoothly migrated all the servers from 3-partition scheme to 4-partition one without changing the client's code.
\ No newline at end of file
Under the production environment, we will gradually increase the number of instance on 4-partition scheme while terminating instance on 3-partition scheme. `DynamicParititonChannel` can divide the traffic based on the capacity of all partitions dynamically. When the capacity of 3-partition scheme drops down to 0, then we've smoothly migrated all the servers from 3-partition scheme to 4-partition one without changing the client's code.
Note that the actual commands processed per second of redis-server is 10 times the QPS value, which is about 400K. When thread_num equals 50 or higher, the CPU usage of the redis-server reaches its limit. Since redis-server runs in [single-thread reactor mode](threading_overview.md#单线程reactor), 99.9% on one core is the maximum CPU it can use.
Note that the actual commands processed per second of redis-server is 10 times the QPS value, which is about 400K. When thread_num equals 50 or higher, the CPU usage of the redis-server reaches its limit. Since redis-server runs in [single-threaded reactor mode](threading_overview.md#单线程reactor), 99.9% on one core is the maximum CPU it can use.
Now start a client to send requests to redis-server from the same machine using 50 bthreads synchronously through connection pool.
A thread/process handles all the messages from a fd and quits until the connection is closed. When the number of connections increases, the resources occupied by threads/processes and the cost of context switching will become increasingly large and cause poor performance, which is source of the [C10K](http://en.wikipedia.org/wiki/C10k_problem) problem. These two methods are common in early web servers and are rarely used today.
## Single-thread reactor
## Single-threaded reactor
The event-loop library such as [libevent](http://libevent.org/)[, ](http://en.wikipedia.org/wiki/Reactor_pattern)[libev](http://software.schmorp.de/pkg/libev.html) is a typical example. Usually a event dispatcher is responsible for waiting different kinds of event and calls event handler in situ after an event happens. After handler is processed, dispatcher waits more events, so called "loop". Essentially all handler functions are executed in the order of occurrence in one system thread. One event-loop can use only one core, so this kind of program is either IO-bound or has a short and fixed running time(such as http server). Otherwise one callback will block the whole program and causes high latencies. In practice this kind of program is not suitable for many people involved, because the performance may be significantly degraded if no enouth attentions are paid. The extensibility of the event-loop program depends on multiple processes.
...
...
@@ -14,9 +14,9 @@ The single-threaded reactor works as shown below:
## N:1 thread library
Generally, N user threads are mapped into a system thread (LWP), and only one user thread can be run, such as [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). When the blocking function is called, current user thread is yield. It also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). N:1 thread library is equal to single-thread reactor. Event callback is replaced by an independent stack and registers, and running callbacks becomes jumping to the corresponding context. Since all the logic runs in a system thread, N:1 thread library does not produce complex race conditions, and some scenarios do not require a lock. Because only one core can be used just like event loop library, N:1 thread library cannot give full play to multi-core performance, only suitable for some specific scenarios. But it also to reduce the jump between different cores, coupled with giving up the independent signal mask, context switch can be done quickly(100 ~ 200ns). Generally, the performance of N:1 thread library is as good as event loop and its extensibility also depends on multiple processes.
Generally, N user threads are mapped into a system thread (LWP), and only one user thread can be run, such as [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). When the blocking function is called, current user thread is yield. It also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). N:1 thread library is equal to single-threaded reactor. Event callback is replaced by an independent stack and registers, and running callbacks becomes jumping to the corresponding context. Since all the logic runs in a system thread, N:1 thread library does not produce complex race conditions, and some scenarios do not require a lock. Because only one core can be used just like event loop library, N:1 thread library cannot give full play to multi-core performance, only suitable for some specific scenarios. But it also to reduce the jump between different cores, coupled with giving up the independent signal mask, context switch can be done quickly(100 ~ 200ns). Generally, the performance of N:1 thread library is as good as event loop and its extensibility also depends on multiple processes.
## Multi-thread reactor
## Multi-threaded reactr
Kylin, [boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html) is a typical example. Generally event dispatcher is run by one or several threads and schedules event handler to a worker thread to run after event happens. Since SMP machines are widely used in Baidu, the structure using multiple cores like this is more suitable and the method of exchanging messages between threads is simpler than that between processes, so it often makes multi-core load more uniform. However, due to cache coherence restrictions, the multi-threaded reactor model does not achieve linearity in the scalability of the core. In a particular scenario, the rough multi-threaded reactor running on a 24-core machine is not even faster than a single-threaded reactor with a dedicated implementation. Reactor has a proactor variant, namely using asynchronous IO to replace event dispatcher. Boost::asio is a proactor under [Windows](http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx).