Commit 684b4bd9 authored by zyearn's avatar zyearn

complete translating atomic_instructions.md

parent 3a61f5ea
......@@ -29,7 +29,7 @@
# Memory fence
仅靠原子累加实现不了对资源的访问控制,即使简单如[spinlock](https://en.wikipedia.org/wiki/Spinlock)[引用计数](https://en.wikipedia.org/wiki/Reference_counting),看上去正确的代码也可能会crash。这里的关键在于**重排指令**导致了读写一致性的变化。只要没有依赖,代码中在后面的指令(包括访存)就可能跑到前面去,[编译器](http://preshing.com/20120625/memory-ordering-at-compile-time/)[CPU](https://en.wikipedia.org/wiki/Out-of-order_execution)都会这么做。这么做的动机非常自然,CPU要尽量塞满每个cycle,在单位时间内运行尽量多的指令。一个核心访问自己独有的cache是很快的,所以它能很好地管理好一致性问题。当软件依次写入a,b,c后,它能以a,b,c的顺序依次读到,哪怕在CPU层面是完全并发运行的。当代码只运行于单线程中时,重排对软件是透明的。但在多核环境中,这就不成立了。如上节中提到的,访存在等待cacheline同步时要花费数百纳秒,最高效地自然是同时同步多个cacheline,而不是一个个做。一个线程在代码中对多个变量的依次修改,可能会以不同的次序同步到另一个线程所在的核心上,CPU也许永远无法保证这个顺序如同TCP那样,有序修改有序读取,因为不同线程对数据的需求顺序是不同的,按需访问更合理(从而导致同步cacheline的序和写序不同)。如果其中第一个变量扮演了开关的作用,控制对后续变量对应资源的访问。那么当这些变量被一起同步到其他核心时,更新顺序可能变了,第一个变量未必是第一个更新的,其他线程可能还认为它代表着其他变量有效,而去访问了已经被删除的资源,从而导致未定义的行为。比如下面的代码片段:
仅靠原子累加实现不了对资源的访问控制,即使简单如[spinlock](https://en.wikipedia.org/wiki/Spinlock)[引用计数](https://en.wikipedia.org/wiki/Reference_counting),看上去正确的代码也可能会crash。这里的关键在于**重排指令**导致了读写顺序的变化。只要没有依赖,代码中在后面的指令(包括访存)就可能跑到前面去,[编译器](http://preshing.com/20120625/memory-ordering-at-compile-time/)[CPU](https://en.wikipedia.org/wiki/Out-of-order_execution)都会这么做。这么做的动机非常自然,CPU要尽量塞满每个cycle,在单位时间内运行尽量多的指令。一个核心访问自己独有的cache是很快的,所以它能很好地管理好一致性问题。当软件依次写入a,b,c后,它能以a,b,c的顺序依次读到,哪怕在CPU层面是完全并发运行的。当代码只运行于单线程中时,重排对软件是透明的。但在多核环境中,这就不成立了。如上节中提到的,访存在等待cacheline同步时要花费数百纳秒,最高效地自然是同时同步多个cacheline,而不是一个个做。一个线程在代码中对多个变量的依次修改,可能会以不同的次序同步到另一个线程所在的核心上,CPU也许永远无法保证这个顺序如同TCP那样,有序修改有序读取,因为不同线程对数据的需求顺序是不同的,按需访问更合理(从而导致同步cacheline的序和写序不同)。如果其中第一个变量扮演了开关的作用,控制对后续变量对应资源的访问。那么当这些变量被一起同步到其他核心时,更新顺序可能变了,第一个变量未必是第一个更新的,其他线程可能还认为它代表着其他变量有效,而去访问了已经被删除的资源,从而导致未定义的行为。比如下面的代码片段:
```c++
// Thread 1
......@@ -49,9 +49,9 @@ if (ready) {
- 线程1中的ready = true可能会被编译器或cpu重排到p.init()之前,从而使线程2看到ready为true时,p仍然未初始化。
- 即使没有重排,ready和p的值也会独立地同步到线程2所在核心的cache,线程2仍然可能在看到ready为true时看到未初始化的p。这种情况同样也会在线程2中发生,比如p.bar()中的一些代码被重排到检查ready之前。
注:x86的load带acquire语意,store带release语意,上面的代码刨除编译器因素可以正确运行。
注:x86的load带acquire语意,store带release语意,上面的代码刨除编译器和CPU因素可以正确运行。
通过这个简单例子,你可以窥见原子指令编程的复杂性了吧。为了解决这个问题,CPU提供了[memory fence](http://en.wikipedia.org/wiki/Memory_barrier),让用户可以声明访存指令间的可见性(visibility)关系,boost和C++11对memory fencing做了抽象,总结为如下几种[memory order](http://en.cppreference.com/w/cpp/atomic/memory_order).
通过这个简单例子,你可以窥见原子指令编程的复杂性了吧。为了解决这个问题,CPU提供了[memory fence](http://en.wikipedia.org/wiki/Memory_barrier),让用户可以声明访存指令间的可见性(visibility)关系,boost和C++11对memory fence做了抽象,总结为如下几种[memory order](http://en.cppreference.com/w/cpp/atomic/memory_order).
| memory order | 作用 |
| -------------------- | ---------------------------------------- |
......
We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem]((https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem](https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
As the name suggests, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state will not be observed. Common atomic instructions include:
......@@ -18,9 +18,82 @@ You can already use these instructions to do atomic counting, such as multiple t
# Cacheline
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Althouth it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds. The most effective solution is straightforward: **aviod sharing as possible as you can**. Avoiding contention from the beginning is the best.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds. The most effective solution is straightforward: **avoid sharing as possible as you can**. Avoiding contention from the beginning is the best.
- A program using a global multiple-producer-multiple-consumer(MPMC) queueis hard to have multi-core scalability, since the limit throughput of this queue depends on the delay of cpu cache synchronization, rather than the number of cores. It is a best practice to use multiple SPMC or multiple MPSC queue, or even multiple SPSC queue instead, avoid contention at the beginning.
- Another example is global counter. If all threads modifies a global variable frequently, the performance would be poor because all cores are busy synchronizing the same cacheline.
- A program using a global multiple-producer-multiple-consumer(MPMC) queue is hard to have multi-core scalability, since the limit throughput of this queue depends on the delay of cpu cache synchronization, rather than the number of cores. It is a best practice to use multiple SPMC or multiple MPSC queue, or even multiple SPSC queue instead, avoid contention at the beginning.
- Another example is global counter. If all threads modify a global variable frequently, the performance would be poor because all cores are busy synchronizing the same cacheline. If the counter is only used to print logs or something like that, we can let each thread modify thread-local variables and combine all the data when need. This may cause performance difference several times.
# Memory fence
Only using atomic addition cannot achieve access control to resources, codes that seem correct may crash as well even for simple [spinlocks](https://en.wikipedia.org/wiki/Spinlock) or [reference count](https://en.wikipedia.org/wiki/Reference_counting). The key point here is that **instruction reorder** change the order of write and read. The instructions(including visiting memory) behind can be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may do this reordering. The motivation is very natural: cpu should be filled with instructions in every cycle to execute as many as possible instructions in unit time. For example,
```c++
// Thread 1
// ready was initialized to false
p.init();
ready = true;
```
```c++
// Thread2
if (ready) {
p.bar();
}
```
From the view of human, this code seems correct because thread2 will access `p` only when `ready` is true and that happens after the initilization of p according to thread1. But for multi-core machines, this code may not run as expected:
- `ready = true` in thread1 may be reordered to the position before `p.init()` by compiler or cpu, then when thread2 has seen that `ready` is true, `p` is still not initialized.
- Even if there is no reordering, the value of `ready` and `p` may be synchronized to the cache of core that thread2 runs independently, thread2 may still call `p.bar()` but `p` is not initialized when `ready` is true. The same situation may happens for thread2 as well. For example, some instructions may be reordered to the position before checking `ready`.
Note: In x86, `load` has acquire semantic, and `store` release semantic, so the code above can run correctly if not considering the reordering done by compiler and cpu.
With this simple example, you can get a glimpse of the complexity of programming using atomic instructions. In order to solve this problem, you can use [memory fence](http://en.wikipedia.org/wiki/Memory_barrier) to let programmer decide the visibility relationship between instructions. Boost and C++11 makes an abstraction of memory fence, which can be concluded as following several memory order:
| memory order | 作用 |
| -------------------- | ---------------------------------------- |
| memory_order_relaxed | there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed |
| memory_order_consume | no reads or writes in the current thread dependent on the value currently loaded can be reordered before this load. |
| memory_order_acquire | no reads or writes in the current thread can be reordered before this load. |
| memory_order_release | no reads or writes in the current thread can be reordered after this store. |
| memory_order_acq_rel | No memory reads or writes in the current thread can be reordered before or after this store. |
| memory_order_seq_cst | Any operation with this memory order is both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order. |
Using memory order, above example can be modified as such:
```c++
// Thread1
// ready was initialized to false
p.init();
ready.store(true, std::memory_order_release);
```
```c++
// Thread2
if (ready.load(std::memory_order_acquire)) {
p.bar();
}
```
The acquire in thread2 is matched with the release in thread1, making sure that thread2 can see all memory operations before release operation in thread1 when `ready` is equal to true in thread2.
Notice that memory fence is not equal to visibility. Even though thread2 read the value of ready just after thread1 set it to true, it cannot be guaranteed that thread2 read the newest value, which is caused by the delay of cache synchronization. Memory fence only guarantees the order of visibility: if the program can read the newest value of a, then it can also read the newest value of b. Why does cpu not notify the program when the newest data is ready? First, the delay of read will be increased. Second, read is busy synchronizing when there are plenty of write, making read starve. What's more, Even if the program has read the newest value, this value could be changed when modification instruction is issued, making this policy meaningless.
Another problem is that if the program read the old value, then nothing should be done, but how does the program know whether it is the old value or the new value? Generally there are two cases:
- Value is special. In above example, `ready=true` is a special value. Reading true from ready means that it is the new value. In this case, every special value has a meaning.
- Always Accumulate. In some situations, there are no special values, we can use instructions like `fetch_add` to accumulate variables. As long as the range of value is big enough, the new value is different from the old value in a long period of time so that we can distinguish them from each other.
More examples can be found [here]((http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html) in boost.atomic. The official description of atomic can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
# wait-free & lock-free
Atomic instructions can provide us two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)[lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how os(operating system) schedules threads, every thread is doing useful work; lock-free is weaker than wait-free, which means no matter how os schedules threads, at least one thread is doing useful work. If locks are used in the program, then os may schedule out the thread holding the lock in which case all threads trying to hold the same lock is waiting. So Using locks are not lock-free and wait-free. To make sure one task is done always within a determined time, the critical path in real-time os is at least lock-free. In our extensive and diverse online service, the property of real-time is demanded eagerly. If the most critical path in brpc satisfies wait-free of lock-free, we can provide a more stable quality of service. For example, since [fd](https://en.wikipedia.org/wiki/File_descriptor) is only suitable for being manipulated by a single thread, atomic instructions are used in brpc to maximize the concurrency of read and write of fd which is discussed more deeply in [here](io.md).
please notice that it is common to think that algorithms using wait-free or lock-free could be faster, but the truth may be the opposite, because:
- Race condition and ABA problem must be handled in lock-free and wait-free algorithms, which means the code may be more complex than that using locks when doing the same task.
- Using mutex has an effect of backoff. Backoff means that when contention happens, it will find another way to avoid fierce contention. The thread getting an locked mutex will be put into sleep state to avoid cacheline synchronization, making the thread holding that mutex can complete the task quickly, which may increase the overall throughput.
The low performance caused by mutex is because either the critical area is too big(limits the concurrency), or the critical area is too small(the overhead of context switch becomes prominent, adaptive mutex should be considered to use). The value of lock-free/wait-free is that it guarantees one thread or all threads always doing useful work, not for absolute high performance. But algorithms using lock-free/wait-free may probably have better performance in the situation where algorithm itself can be implemented using a few atomic instructions. Atomic instructions are also used to implement mutex, so if the algorithm can be done just using one or two atomic instructions, it could be faster than that using mutex.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment