Commit af78a11c authored by gejun's avatar gejun

Polish atomic_instructions.md and add mutual links

parent 09b16834
[English version](../en/atomic_instructions.md)
我们都知道多核编程常用锁避免多个线程在修改同一个数据时产生[race condition](http://en.wikipedia.org/wiki/Race_condition)。当锁成为性能瓶颈时,我们又总想试着绕开它,而不可避免地接触了原子指令。但在实践中,用原子指令写出正确的代码是一件非常困难的事,琢磨不透的race condition、[ABA problem](https://en.wikipedia.org/wiki/ABA_problem)[memory fence](https://en.wikipedia.org/wiki/Memory_barrier)很烧脑,这篇文章试图通过介绍[SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)架构下的原子指令帮助大家入门。C++11正式引入了[原子指令](http://en.cppreference.com/w/cpp/atomic/atomic),我们就以其语法描述。
顾名思义,原子指令是**对软件**不可再分的指令,比如x.fetch_add(n)指原子地给x加上n,这个指令**对软件**要么没做,要么完成,不会观察到中间状态。常见的原子指令有:
......@@ -18,18 +20,22 @@
# Cacheline
没有任何竞争或只被一个线程访问的原子操作是比较快的,“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能,大量使用了cache,并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache,256K的L2 cache和15M的L3 cache。其中L1和L2 cache为每个核心独有,L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, ~2ns),但当另一个核心读或写同一处内存时,它得确认看到其他核心中对应的cacheline。对于软件来说,这个过程是原子的,不能在中间穿插其他代码,只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence),这个复杂的算法使得原子操作会变得很慢,在E5-2620上竞争激烈时fetch_add会耗费700纳秒左右。访问被多个线程频繁共享的内存往往是比较慢的。比如像一些临界区很小的场景,使用spinlock效果仍然不佳,问题就在于实现spinlock使用的exchange, fetch_add等指令必须等待最新的cacheline,看上去只有几条指令,花费若干微秒并不奇怪。
没有任何竞争或只被一个线程访问的原子操作是比较快的,“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能,大量使用了cache,并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache,256K的L2 cache和15M的L3 cache。其中L1和L2 cache为每个核心独有,L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, ~2ns),但当另一个核心读或写同一处内存时,它得确认看到其他核心中对应的cacheline。对于软件来说,这个过程是原子的,不能在中间穿插其他代码,只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence),这个复杂的硬件算法使得原子操作会变得很慢,在E5-2620上竞争激烈时fetch_add会耗费700纳秒左右。访问被多个线程频繁共享的内存往往是比较慢的。比如像一些场景临界区看着很小,但保护它的spinlock性能不佳,因为spinlock使用的exchange, fetch_add等指令必须等待最新的cacheline,看上去只有几条指令,花费若干微秒并不奇怪。
要提高性能,就要避免让CPU频繁同步cacheline。这不单和原子指令本身的性能有关,还会影响到程序的整体性能。最有效的解决方法很直白:**尽量避免共享**从源头规避掉竞争是最好的,有竞争就要协调,而协调总是很难的。
要提高性能,就要避免让CPU频繁同步cacheline。这不单和原子指令本身的性能有关,还会影响到程序的整体性能。最有效的解决方法很直白:**尽量避免共享**
- 一个依赖全局多生产者多消费者队列(MPMC)的程序难有很好的多核扩展性,因为这个队列的极限吞吐取决于同步cache的延时,而不是核心的个数。最好是用多个SPMC或多个MPSC队列,甚至多个SPSC队列代替,在源头就规避掉竞争。
- 另一个例子是全局计数器,如果所有线程都频繁修改一个全局变量,性能就会很差,原因同样在于不同的核心在不停地同步同一个cacheline。如果这个计数器只是用作打打日志之类的,那我们完全可以让每个线程修改thread-local变量,在需要时再合并所有线程中的值,性能可能有[几十倍的差别](bvar.md)
- 另一个例子是计数器,如果所有线程都频繁修改一个计数器,性能就会很差,原因同样在于不同的核心在不停地同步同一个cacheline。如果这个计数器只是用作打打日志之类的,那我们完全可以让每个线程修改thread-local变量,在需要时再合并所有线程中的值,性能可能有[几十倍的差别](bvar.md)
做不到完全不共享,那就尽量少共享。在一些读很多的场景下,也许可以降低写的频率以减少同步cacheline的次数,以加快读的平均性能。一个相关的编程陷阱是避免false sharing:这指的是那些不怎么被修改的变量,由于同一个cacheline中的另一个变量被频繁修改,而不得不经常等待cacheline同步而显著变慢了。多线程中的变量尽量按访问规律排列,频繁被其他线程的修改要放在独立的cacheline中。要让一个变量或结构体按cacheline对齐,可以include <butil/macros.h>然后使用BAIDU_CACHELINE_ALIGNMENT宏,用法请自行grep一下brpc的代码了解
一个相关的编程陷阱是false sharing:对那些不怎么被修改甚至只读变量的访问,由于同一个cacheline中的其他变量被频繁修改,而不得不经常等待cacheline同步而显著变慢了。多线程中的变量尽量按访问规律排列,频繁被其他线程修改的变量要放在独立的cacheline中。要让一个变量或结构体按cacheline对齐,可以include \<butil/macros.h\>后使用BAIDU_CACHELINE_ALIGNMENT宏,请自行grep brpc的代码了解用法
# Memory fence
仅靠原子累加实现不了对资源的访问控制,即使简单如[spinlock](https://en.wikipedia.org/wiki/Spinlock)[引用计数](https://en.wikipedia.org/wiki/Reference_counting),看上去正确的代码也可能会crash。这里的关键在于**重排指令**导致了读写顺序的变化。只要没有依赖,代码中在后面的指令(包括访存)就可能跑到前面去,[编译器](http://preshing.com/20120625/memory-ordering-at-compile-time/)[CPU](https://en.wikipedia.org/wiki/Out-of-order_execution)都会这么做。这么做的动机非常自然,CPU要尽量塞满每个cycle,在单位时间内运行尽量多的指令。一个核心访问自己独有的cache是很快的,所以它能很好地管理好一致性问题。当软件依次写入a,b,c后,它能以a,b,c的顺序依次读到,哪怕在CPU层面是完全并发运行的。当代码只运行于单线程中时,重排对软件是透明的。但在多核环境中,这就不成立了。如上节中提到的,访存在等待cacheline同步时要花费数百纳秒,最高效地自然是同时同步多个cacheline,而不是一个个做。一个线程在代码中对多个变量的依次修改,可能会以不同的次序同步到另一个线程所在的核心上,CPU也许永远无法保证这个顺序如同TCP那样,有序修改有序读取,因为不同线程对数据的需求顺序是不同的,按需访问更合理(从而导致同步cacheline的序和写序不同)。如果其中第一个变量扮演了开关的作用,控制对后续变量对应资源的访问。那么当这些变量被一起同步到其他核心时,更新顺序可能变了,第一个变量未必是第一个更新的,其他线程可能还认为它代表着其他变量有效,而去访问了已经被删除的资源,从而导致未定义的行为。比如下面的代码片段:
仅靠原子技术实现不了对资源的访问控制,即使简单如[spinlock](https://en.wikipedia.org/wiki/Spinlock)[引用计数](https://en.wikipedia.org/wiki/Reference_counting),看上去正确的代码也可能会crash。这里的关键在于**重排指令**导致了读写顺序的变化。只要没有依赖,代码中在后面的指令就可能跑到前面去,[编译器](http://preshing.com/20120625/memory-ordering-at-compile-time/)[CPU](https://en.wikipedia.org/wiki/Out-of-order_execution)都会这么做。
这么做的动机非常自然,CPU要尽量塞满每个cycle,在单位时间内运行尽量多的指令。如上节中提到的,访存指令在等待cacheline同步时要花费数百纳秒,最高效地自然是同时同步多个cacheline,而不是一个个做。一个线程在代码中对多个变量的依次修改,可能会以不同的次序同步到另一个线程所在的核心上。不同线程对数据的需求不同,按需同步也会导致cacheline的读序和写序不同。
如果其中第一个变量扮演了开关的作用,控制对后续变量的访问。那么当这些变量被一起同步到其他核心时,更新顺序可能变了,第一个变量未必是第一个更新的,然而其他线程还认为它代表着其他变量有效,去访问了实际已被删除的变量,从而导致未定义的行为。比如下面的代码片段:
```c++
// Thread 1
......@@ -46,12 +52,12 @@ if (ready) {
```
从人的角度,这是对的,因为线程2在ready为true时才会访问p,按线程1的逻辑,此时p应该初始化好了。但对多核机器而言,这段代码可能难以正常运行:
- 线程1中的ready = true可能会被编译器或cpu重排到p.init()之前,从而使线程2看到ready为true时,p仍然未初始化。
- 即使没有重排,ready和p的值也会独立地同步到线程2所在核心的cache,线程2仍然可能在看到ready为true时看到未初始化的p。这种情况同样也会在线程2中发生,比如p.bar()中的一些代码被重排到检查ready之前。
- 线程1中的ready = true可能会被编译器或cpu重排到p.init()之前,从而使线程2看到ready为true时,p仍然未初始化。这种情况同样也会在线程2中发生,p.bar()中的一些代码可能被重排到检查ready之前。
- 即使没有重排,ready和p的值也会独立地同步到线程2所在核心的cache,线程2仍然可能在看到ready为true时看到未初始化的p。
注:x86的load带acquire语意,store带release语意,上面的代码刨除编译器和CPU因素可以正确运行。
注:x86/x64的load带acquire语意,store带release语意,上面的代码刨除编译器和CPU因素可以正确运行。
通过这个简单例子,你可以窥见原子指令编程的复杂性了吧。为了解决这个问题,CPU提供了[memory fence](http://en.wikipedia.org/wiki/Memory_barrier),让用户可以声明访存指令间的可见性(visibility)关系,boost和C++11对memory fence做了抽象,总结为如下几种[memory order](http://en.cppreference.com/w/cpp/atomic/memory_order).
通过这个简单例子,你可以窥见原子指令编程的复杂性了吧。为了解决这个问题,CPU和编译器提供了[memory fence](http://en.wikipedia.org/wiki/Memory_barrier),让用户可以声明访存指令间的可见性(visibility)关系,boost和C++11对memory fence做了抽象,总结为如下几种[memory order](http://en.cppreference.com/w/cpp/atomic/memory_order).
| memory order | 作用 |
| -------------------- | ---------------------------------------- |
......@@ -80,9 +86,9 @@ if (ready.load(std::memory_order_acquire)) {
线程2中的acquire和线程1的release配对,确保线程2在看到ready==true时能看到线程1 release之前所有的访存操作。
注意,memory fence不等于可见性,即使线程2恰好在线程1在把ready设置为true后读取了ready也不意味着它能看到true,因为同步cache是有延时的。memory fence保证的是可见性的顺序:“假如我看到了a的最新值,那么我一定也得看到b的最新值”。为什么CPU不在读到最新值后才告知软件呢?首先这么做增加了读的延时,其次当写很多时,读就一直忙不迭地同步,最终被饿死。况且即使软件拿到了最新值,等它做出决策发起修改时,最新值可能又变了,这个决策变得毫无意义。
注意,memory fence不等于可见性,即使线程2恰好在线程1在把ready设置为true后读取了ready也不意味着它能看到true,因为同步cache是有延时的。memory fence保证的是可见性的顺序:“假如我看到了a的最新值,那么我一定也得看到b的最新值”。
另一个问题是:如果我看到的是a的旧值,那我也许什么都不该干。那我怎么知道看到的是新值还是旧值?一般分两种情况:
一个相关问题是:如何知道看到的值是新还是旧?一般分两种情况:
- 值是特殊的。比如在上面的例子中,ready=true是个特殊值,只要线程2看到ready为true就意味着更新了。只要设定了特殊值,读到或没有读到特殊值都代表了一种含义。
- 总是累加。一些场景下没有特殊值,那我们就用fetch_add之类的指令累加一个变量,只要变量的值域足够大,在很长一段时间内,新值和之前所有的旧值都会不相同,我们就能区分彼此了。
......@@ -91,11 +97,11 @@ if (ready.load(std::memory_order_acquire)) {
# wait-free & lock-free
原子指令能为我们的服务赋予两个重要属性:[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)[lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom)。前者指不管OS如何调度线程,每个线程都始终在做有用的事;后者比前者弱一些,指不管OS如何调度线程,至少有一个线程在做有用的事。如果我们的服务中使用了锁,那么OS可能把一个刚获得锁的线程切换出去,这时候所有依赖这个锁的线程都在等待,而没有做有用的事,所以用了锁就不是lock-free,更不会是wait-free。为了确保一件事情总在确定时间内完成,实时系统的关键代码至少是lock-free的。在我们广泛又多样的在线服务中,对时效性也有着严苛的要求,如果RPC中最关键的部分满足wait-free或lock-free,就可以提供更稳定的服务质量。比如,由于[fd](https://en.wikipedia.org/wiki/File_descriptor)只适合被单个线程操作,brpc中使用原子指令最大化了fd的读写的并发度,具体见[IO](io.md)
原子指令能为我们的服务赋予两个重要属性:[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)[lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom)。前者指不管OS如何调度线程,每个线程都始终在做有用的事;后者比前者弱一些,指不管OS如何调度线程,至少有一个线程在做有用的事。如果我们的服务中使用了锁,那么OS可能把一个刚获得锁的线程切换出去,这时候所有依赖这个锁的线程都在等待,而没有做有用的事,所以用了锁就不是lock-free,更不会是wait-free。为了确保一件事情总在确定时间内完成,实时系统的关键代码至少是lock-free的。在百度广泛又多样的在线服务中,对时效性也有着严苛的要求,如果RPC中最关键的部分满足wait-free或lock-free,就可以提供更稳定的服务质量。事实上,brpc中的读写都是wait-free的,具体见[IO](io.md)
值得提醒的是,常见想法是lock-free或wait-free的算法会更快,但事实可能相反,因为:
- lock-free和wait-free必须处理复杂的race condition和ABA problem,完成相同目的的代码比用锁更复杂
- 使用mutex的算法变相带“后退”效果。后退(backoff)指出现竞争时尝试另一个途径以避免激烈的竞争,mutex出现竞争时会使调用者睡眠,在高度竞争时规避了激烈的cacheline同步,使拿到锁的那个线程可以很快地独占完成一系列流程,总体吞吐可能反而高了。
- lock-free和wait-free必须处理更多更复杂的race condition和ABA problem,完成相同目的的代码比用锁更复杂。代码越多,耗时就越长
- 使用mutex的算法变相带“后退”效果。后退(backoff)指出现竞争时尝试另一个途径以临时避免竞争,mutex出现竞争时会使调用者睡眠,使拿到锁的那个线程可以很快地独占完成一系列流程,总体吞吐可能反而高了。
mutex导致低性能往往是因为临界区过大(限制了并发度),或临界区过小(上下文切换开销变得突出,应考虑用adaptive mutex)。lock-free/wait-free算法的价值在于其保证了一个或所有线程始终在做有用的事,而不是绝对的高性能。但在一种情况下lock-free和wait-free算法的性能多半更高:就是算法本身可以用少量原子指令实现。实现锁也是要用原子指令的,当算法本身用一两条指令就能完成的时候,相比额外用锁肯定是更快了。
mutex导致低性能往往是因为临界区过大(限制了并发度),或竞争过于激烈(上下文切换开销变得突出)。lock-free/wait-free算法的价值在于其保证了一个或所有线程始终在做有用的事,而不是绝对的高性能。但在一种情况下lock-free和wait-free算法的性能多半更高:就是算法本身可以用少量原子指令实现。实现锁也是要用原子指令的,当算法本身用一两条指令就能完成的时候,相比额外用锁肯定是更快了。
We know that locks are extensively used in multi-thread programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. When a lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in practice and it is hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This artical tries to introduce some basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
[中文版](../cn/atomic_instructions.md)
As the name implies, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
We know that locks are extensively used in multi-thread programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying shared data. When the lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in generally and it is even hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This article tries to introduce basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
As the name implies, atomic instructions cannot be divided into more sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
| Atomic Instructions(type of x is std::atomic\<int\>) | Descriptions |
| ---------------------------------------- | ---------------------------------------- |
| x.load() | return the value of x. |
| x.store(n) | store n to x and return nothing. |
| x.exchange(n) | set x to n and return the value just before the atomical set |
| x.exchange(n) | set x to n and return the value just before the modification |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, set x to desired and return true. Otherwise write current x to expected_ref and return false. |
| x.compare_exchange_weak(expected_ref, desired) | may have [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup) comparing to compare_exchange_strong |
| x.fetch_add(n), x.fetch_sub(n) | do x += n, x-= n atomically. Return the value just before the modification. |
You can already use these instructions to count stuff atomically, such as counting number of operations on resources used by multiple threads simultaneously. However two problems may arise:
You can already use these instructions to count something atomically, such as counting number of operations from multiple threads. However two problems may arise:
- The operation is not as fast as expected.
- If multi-threaded accesses to some resources are controlled by a few atomic operations that seem to be correct, the program still has great chance to crash.
- Even if multi-threaded accesses to some resources are controlled by a few atomic instructions that seem to be correct, the program still has great chance to crash.
# Cacheline
An atomic instruction is fast when there's no contentions or accessed by only one thread. "Contentions" happen when multiple threads access the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use caches and divide caches into multiple levels to get high performance with a low price. The Intel E5-2620 widely used in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is very fast for one core to write data into its own L1 cache(4 cycles, ~2ns), let the data in L1 cache be seen by other cores is not, because cachelines touched by the data need to be synchronized to other cores. This process is atomic and transparent to software and no instructions can be interleaved between. Applications have wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes much longer time than writing local cache. It involves a complicated algorithm and makes atomic instructions slow under high contentions. A single fetch_add may take more than 700ns in E5-2620 when a few threads are highly contented on the instruction. Accesses to the memory frequently shared and modified by multiple threads are not fast generally. For example, even if the critical section is small, using a spinlock may still not work well. The cause is that the instructions used in spinlock such as exchange, fetch_add etc, need to wait for latest cachelines. It's not surprising to see that one or two instructions take several microseconds.
An atomic instruction is fast when there's no contentions or it's accessed only by one thread. "Contentions" happen when multiple threads access the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use caches and divide caches into multiple levels to get high performance with a low price. The Intel E5-2620 widely used in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is very fast for one core to write data into its own L1 cache(4 cycles, ~2ns), make the data in L1 cache seen by other cores is not, because cachelines touched by the data need to be synchronized to other cores. This process is atomic and transparent to software and no instructions can be interleaved between. Applications have to wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes much longer time than writing local cache. It involves a complicated hardware algorithm and makes atomic instructions slow under high contentions. A single fetch_add may take more than 700ns in E5-2620 when a few threads are highly contented on the instruction. Accesses to the memory frequently shared and modified by multiple threads are not fast generally. For example, even if a critical section looks small, the spinlock protecting it may still not perform well. The cause is that the instructions used in spinlock such as exchange, fetch_add etc, need to wait for latest cachelines. It's not surprising to see that one or two instructions take several microseconds.
In order to improve performance, we need to avoid frequently synchronizing cachelines, which not only affects performance of the atomic instruction itself, but also overall performance of the program. The most effective solution is straightforward: **avoid sharing as much as possible**.
In order to improve performance, we need to avoid synchronizing cachelines frequently, which not only affects performance of the atomic instruction itself, but also overall performance of the program. The most effective solution is straightforward: **avoid sharing as much as possible**. Avoiding contentions from the very beginning is the best strategy:
- A program relying on a global multiple-producer-multiple-consumer(MPMC) queue is hard to scale well on many CPU cores, because throughput of the queue is limited by delays of cache coherence, rather than number of cores. It would be better to use multiple SPMC or MPSC queues, or even SPSC queues instead, to avoid contentions from the very beginning.
- Another example is counters. If all threads modify a counter frequently, the performance will be poor because all cores are busy synchronizing the same cacheline. If the counter is only used for printing logs periodically or something low-priority like that, we can let each thread modify its own thread-local variables and combine all thread-local data before reading, yielding [much better performance](bvar.md).
- A program relying on a global multiple-producer-multiple-consumer(MPMC) queue is hard to scale well on many cpu cores, since throughput of the queue is limited by delays of cache coherence, rather than the number of cores. It would be better to use multiple SPMC or MPSC queues, or even SPSC queues instead, to avoid contentions from the beginning.
- Another example is global counter. If all threads modify a global variable frequently, the performance will be poor because all cores are busy at synchronizing the same cacheline. If the counter is only used for printing logs periodically or something like that, we can let each thread modify its own thread-local variables and combine all thread-local data for a read, resulting a [much better performance](bvar.md).
A related programming trap is false sharing: Accesses to infrequently updated or even read-only variables are significantly slowed down because other variables in the same cacheline are frequently updated. Variables used in multi-threaded environment should be grouped by accessing frequencies or patterns, variables that are modified by that other threads frequently should be put into separated cachelines. To align a variable or struct by cacheline, `include <butil/macros.h>` and tag it with macro `BAIDU_CACHELINE_ALIGNMENT`, grep source code of brpc for examples.
# Memory fence
Only using atomic addition cannot achieve access control to resources, codes that seem correct may crash as well even for simple [spinlocks](https://en.wikipedia.org/wiki/Spinlock) or [reference count](https://en.wikipedia.org/wiki/Reference_counting). The key point here is that **instruction reorder** change the order of write and read. The instructions(including visiting memory) behind can be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may do this reordering. The motivation is very natural: cpu should be filled with instructions in every cycle to execute as many as possible instructions in unit time. For example,
Just atomic counting cannot synchronize accesses to resources, simple structures like [spinlock](https://en.wikipedia.org/wiki/Spinlock) or [reference counting](https://en.wikipedia.org/wiki/Reference_counting) that seem correct may crash as well. The key is **instruction reordering**, which may change the order of read/write and cause instructions behind to be reordered to front if there are no dependencies. [Compiler](http://preshing.com/20120625/memory-ordering-at-compile-time/) and [CPU](https://en.wikipedia.org/wiki/Out-of-order_execution) both may reorder.
The motivation is natural: CPU wants to fill each cycle with instructions and execute as many as possible instructions within given time. As above section says, an instruction for loading memory may cost hundreds of nanoseconds for synchronizing the cacheline. A efficient solution to synchronize multiple cachelines is to move them simultaneously rather than one-by-one. Thus modifications to multiple variables by a thread may be visible to another thread in a different order. On the other hand, different threads need different data, synchronizing on-demand is reasonable and may also change order between cachelines.
If the first variable plays the role of switch, controlling accesses to following variables. When these variables are synchronized to other CPU cores, new values may be visible in a different order, and the first variable may not be the first one updated, which causes other threads to think that the following variables are still valid, which are actually deleted, causing undefined behavior. For example:
```c++
// Thread 1
......@@ -42,16 +50,16 @@ if (ready) {
p.bar();
}
```
From the view of human, this code seems correct because thread2 will access `p` only when `ready` is true and that happens after the initilization of p according to thread1. But for multi-core machines, this code may not run as expected:
From a human perspective, the code is correct because thread2 only accesses `p` when `ready` is true which means p is initialized according to logic in thread1. But the code may not run as expected on multi-core machines:
- `ready = true` in thread1 may be reordered to the position before `p.init()` by compiler or cpu, then when thread2 has seen that `ready` is true, `p` is still not initialized.
- Even if there is no reordering, the value of `ready` and `p` may be synchronized to the cache of core that thread2 runs independently, thread2 may still call `p.bar()` but `p` is not initialized when `ready` is true. The same situation may happens for thread2 as well. For example, some instructions may be reordered to the position before checking `ready`.
- `ready = true` in thread1 may be reordered before `p.init()` by compiler or CPU, making thread2 see uninitialized `p` when `ready` is true. The same reordering may happen in thread2 as well. Some instructions in `p.bar()` may be reordered before checking `ready`.
- Even if the above reordering did not happen, cachelines of `ready` and `p` may be synchronized independently to the CPU core that thread2 runs, making thread2 see unitialized `p` when `ready` is true.
Note: In x86, `load` has acquire semantic, and `store` release semantic, so the code above can run correctly if not considering the reordering done by compiler and cpu.
Note: On x86/x64, `load` has acquire semantics and `store` has release semantics by default, the code above may run correctly provided that reordering by compiler is turned off.
With this simple example, you can get a glimpse of the complexity of programming using atomic instructions. In order to solve this problem, you can use [memory fence](http://en.wikipedia.org/wiki/Memory_barrier) to let programmer decide the visibility relationship between instructions. Boost and C++11 makes an abstraction of memory fence, which can be concluded as following several memory order:
With this simple example, you may get a glimpse of the complexity of programming using atomic instructions. In order to solve the reordering issue, CPU and compiler offer [memory fences](http://en.wikipedia.org/wiki/Memory_barrier) to let programmers decide the visibility order between instructions. boost and C++11 conclude memory fence into following types:
| memory order | 作用 |
| memory order | Description |
| -------------------- | ---------------------------------------- |
| memory_order_relaxed | there are no synchronization or ordering constraints imposed on other reads or writes, only this operation's atomicity is guaranteed |
| memory_order_consume | no reads or writes in the current thread dependent on the value currently loaded can be reordered before this load. |
......@@ -60,7 +68,7 @@ With this simple example, you can get a glimpse of the complexity of programming
| memory_order_acq_rel | No memory reads or writes in the current thread can be reordered before or after this store. |
| memory_order_seq_cst | Any operation with this memory order is both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order. |
Using memory order, above example can be modified as such:
Above example can be modified as follows:
```c++
// Thread1
......@@ -76,24 +84,24 @@ if (ready.load(std::memory_order_acquire)) {
}
```
The acquire in thread2 is matched with the release in thread1, making sure that thread2 can see all memory operations before release operation in thread1 when `ready` is equal to true in thread2.
The acquire fence in thread2 matches the release fence in thread1, making thread2 see all memory operations before the release fence in thread1 when thread2 sees `ready` being set to true.
Notice that memory fence is not equal to visibility. Even though thread2 read the value of ready just after thread1 set it to true, it cannot be guaranteed that thread2 read the newest value, which is caused by the delay of cache synchronization. Memory fence only guarantees the order of visibility: if the program can read the newest value of a, then it can also read the newest value of b. Why does cpu not notify the program when the newest data is ready? First, the delay of read will be increased. Second, read is busy synchronizing when there are plenty of write, making read starve. What's more, Even if the program has read the newest value, this value could be changed when modification instruction is issued, making this policy meaningless.
Note that memory fence does not guarantee visibility. Even if thread2 reads `ready` just after thread1 sets it to true, thread2 is not guaranteed to see the new value, because cache synchronization takes time. Memory fence guarantees order of visibilities: "If I see the new value of a, I should see the new value of b as well".
Another problem is that if the program read the old value, then nothing should be done, but how does the program know whether it is the old value or the new value? Generally there are two cases:
A related problem is that: How to know whether a value is new or old? Two cases in generally:
- Value is special. In above example, `ready=true` is a special value. Reading true from ready means that it is the new value. In this case, every special value has a meaning.
- Always Accumulate. In some situations, there are no special values, we can use instructions like `fetch_add` to accumulate variables. As long as the range of value is big enough, the new value is different from the old value in a long period of time so that we can distinguish them from each other.
- The value is special. In above example, `ready=true` is a special value. Once `ready` is true, `p` is ready. Reading special values or not both mean something.
- Increasing-only values. Some situations do not have special values, we can use instructions like `fetch_add` to increase variables. As long as the value range is large enough, new values are different from old ones for a long period of time, so that we can distinguish them from each other.
More examples can be found [here](http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html) in boost.atomic. The official description of atomic can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
More examples can be found in [boost.atomic](http://www.boost.org/doc/libs/1_56_0/doc/html/atomic/usage_examples.html). Official descriptions of atomic can be found [here](http://en.cppreference.com/w/cpp/atomic/atomic).
# wait-free & lock-free
Atomic instructions can provide us two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)[lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how os(operating system) schedules threads, every thread is doing useful work; lock-free is weaker than wait-free, which means no matter how os schedules threads, at least one thread is doing useful work. If locks are used in the program, then os may schedule out the thread holding the lock in which case all threads trying to hold the same lock is waiting. So Using locks are not lock-free and wait-free. To make sure one task is done always within a determined time, the critical path in real-time os is at least lock-free. In our extensive and diverse online service, the property of real-time is demanded eagerly. If the most critical path in brpc satisfies wait-free of lock-free, we can provide a more stable quality of service. For example, since [fd](https://en.wikipedia.org/wiki/File_descriptor) is only suitable for being manipulated by a single thread, atomic instructions are used in brpc to maximize the concurrency of read and write of fd which is discussed more deeply in [here](io.md).
Atomic instructions provide two important properties: [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom) and [lock-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Lock-freedom). Wait-free means no matter how OS schedules, all threads are doing useful jobs; lock-free, which is weaker than wait-free, means no matter how OS schedules, at least one thread is doing useful jobs. If locks are used, the thread holding the lock might be swapped out by OS, in which case all threads trying to hold the lock are blocked. So code using locks are neither lock-free nor wait-free. To make tasks done within given time, critical paths in real-time OS is at least lock-free. Miscellaneous online services inside Baidu also pose serious restrictions to running time. If the critical path in brpc is wait-free or lock-free, many services are benefited by better and stable QoS. Actually, both read(in the sense of even dispatching) and write in brpc are wait-free, check [IO](io.md) for more.
please notice that it is common to think that algorithms using wait-free or lock-free could be faster, but the truth may be the opposite, because:
Note that it is common to think that wait-free or lock-free algorithms are faster, which may not be true, because:
- Race condition and ABA problem must be handled in lock-free and wait-free algorithms, which means the code may be more complex than that using locks when doing the same task.
- Using mutex has an effect of backoff. Backoff means that when contention happens, it will find another way to avoid fierce contention. The thread getting an locked mutex will be put into sleep state to avoid cacheline synchronization, making the thread holding that mutex can complete the task quickly, which may increase the overall throughput.
- More complex race conditions and ABA problems must be handled in lock-free and wait-free algorithms, which means the code is often much more complicated than the one using locks. More code, more running time.
- Mutex solves contentions by backoff, which means that when contention happens, another way is chosen to avoid the contention temporarily. Threads failed to lock a mutex are put into sleep, making the thread holding the mutex complete the task or even following several tasks exclusively, which may increase the overall throughput.
The low performance caused by mutex is because either the critical area is too big(which limits the concurrency), or the critical area is too small(the overhead of context switch becomes prominent, adaptive mutex should be considered to use). The value of lock-free/wait-free is that it guarantees one thread or all threads always doing useful work, not for absolute high performance. But algorithms using lock-free/wait-free may probably have better performance in the situation where algorithm itself can be implemented using a few atomic instructions. Atomic instructions are also used to implement mutex, so if the algorithm can be done just using one or two atomic instructions, it could be faster than that using mutex.
Low performance caused by mutex is either because of too large critical sections (which limit the concurrency), or too heavy contentions (overhead of context switches becomes dominating). The real value of lock-free/wait-free algorithms is that they guarantee progress of one thread or all threads, rather than absolutely high performance. Of course lock-free/wait-free algorithms perform better in some situations: if an algorithm is implemented by just one or two atomic instructions, it's probably faster than the one using mutex which is alos implemented by atomic instructions.
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment