Commit 2ae045ba authored by gejun's avatar gejun

reviewed part of atomic_instructions.md

parent ca7687c7
我们都知道多核编程要用锁,以避免多个线程在修改同一个数据时产生[race condition](http://en.wikipedia.org/wiki/Race_condition)。当锁成为性能瓶颈时,我们又总想试着绕开它,而不可避免地接触了原子指令。但在实践中,用原子指令写出正确的代码是一件非常困难的事,琢磨不透的race condition、[ABA problem](https://en.wikipedia.org/wiki/ABA_problem)[memory fence](https://en.wikipedia.org/wiki/Memory_barrier)很烧脑,这篇文章试图通过介绍[SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)架构下的原子指令帮助大家入门。C++11正式引入了[原子指令](http://en.cppreference.com/w/cpp/atomic/atomic),我们就以其语法描述。
我们都知道多核编程常用锁避免多个线程在修改同一个数据时产生[race condition](http://en.wikipedia.org/wiki/Race_condition)。当锁成为性能瓶颈时,我们又总想试着绕开它,而不可避免地接触了原子指令。但在实践中,用原子指令写出正确的代码是一件非常困难的事,琢磨不透的race condition、[ABA problem](https://en.wikipedia.org/wiki/ABA_problem)[memory fence](https://en.wikipedia.org/wiki/Memory_barrier)很烧脑,这篇文章试图通过介绍[SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)架构下的原子指令帮助大家入门。C++11正式引入了[原子指令](http://en.cppreference.com/w/cpp/atomic/atomic),我们就以其语法描述。
顾名思义,原子指令是**对软件**不可再分的指令,比如x.fetch_add(n)指原子地给x加上n,这个指令**对软件**要么没做,要么完成,不会观察到中间状态。常见的原子指令有:
......@@ -9,7 +9,7 @@
| x.exchange(n) | 把x设为n,返回设定之前的值。 |
| x.compare_exchange_strong(expected_ref, desired) | 若x等于expected_ref,则设为desired,返回成功;否则把最新值写入expected_ref,返回失败。 |
| x.compare_exchange_weak(expected_ref, desired) | 相比compare_exchange_strong可能有[spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup)。 |
| x.fetch_add(n), x.fetch_sub(n), x.fetch_xxx(n) | x += n, x-= n(或更多指令),返回修改之前的值。 |
| x.fetch_add(n), x.fetch_sub(n) | 原子地做x += n, x-= n,返回修改之前的值。 |
你已经可以用这些指令做原子计数,比如多个线程同时累加一个原子变量,以统计这些线程对一些资源的操作次数。但是,这可能会有两个问题:
......@@ -18,12 +18,12 @@
# Cacheline
没有任何竞争或只被一个线程访问的原子操作是比较快的,“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能,大量使用了cache,并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache,256K的L2 cache和15M的L3 cache。其中L1和L2cache为每个核心独有,L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, 2ns),但当另一个核心读或写同一处内存时,它得确认看到其他核心中对应的cacheline。对于软件来说,这个过程是原子的,不能在中间穿插其他代码,只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence),这个复杂的算法相比其他操作耗时会很长,在E5-2620上竞争激烈时大约在700ns左右。所以访问被多个线程频繁共享的内存是比较慢的
没有任何竞争或只被一个线程访问的原子操作是比较快的,“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能,大量使用了cache,并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache,256K的L2 cache和15M的L3 cache。其中L1和L2 cache为每个核心独有,L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, ~2ns),但当另一个核心读或写同一处内存时,它得确认看到其他核心中对应的cacheline。对于软件来说,这个过程是原子的,不能在中间穿插其他代码,只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence),这个复杂的算法使得原子操作会变得很慢,在E5-2620上竞争激烈时fetch_add会耗费700纳秒左右。访问被多个线程频繁共享的内存往往是比较慢的。比如像一些临界区很小的场景,使用spinlock效果仍然不佳,问题就在于实现spinlock使用的exchange, fetch_add等指令必须等待最新的cacheline,看上去只有几条指令,花费若干微秒并不奇怪
要提高性能,就要避免让CPU同步cacheline。这不单和原子指令本身的性能有关,还会影响到程序的整体性能。比如像一些临界区很小的场景,使用spinlock效果仍然不佳,问题就在于实现spinlock使用的exchange,fetch_add等指令必须在CPU同步好最新的cacheline后才能完成,看上去只有几条指令,花费若干微秒却不奇怪。最有效的解决方法很直白:**尽量避免共享**。从源头规避掉竞争是最好的,有竞争就要协调,而协调总是很难的。
要提高性能,就要避免让CPU频繁同步cacheline。这不单和原子指令本身的性能有关,还会影响到程序的整体性能。最有效的解决方法很直白:**尽量避免共享**。从源头规避掉竞争是最好的,有竞争就要协调,而协调总是很难的。
- 一个依赖全局多生产者多消费者队列(MPMC)的程序难有很好的多核扩展性,因为这个队列的极限吞吐取决于同步cache的延时,而不是核心的个数。最好是用多个SPMC或多个MPSC队列,甚至多个SPSC队列代替,在源头就规避掉竞争。
- 另一个例子是全局计数器,如果所有线程都频繁修改一个全局变量,性能就会很差,原因同样在于不同的核心在不停地同步同一个cacheline。如果这个计数器只是用作打打日志之类的,那我们完全可以让每个线程修改thread-local变量,在需要时再合并所有线程中的值,性能可能有几十倍的差别
- 另一个例子是全局计数器,如果所有线程都频繁修改一个全局变量,性能就会很差,原因同样在于不同的核心在不停地同步同一个cacheline。如果这个计数器只是用作打打日志之类的,那我们完全可以让每个线程修改thread-local变量,在需要时再合并所有线程中的值,性能可能有[几十倍的差别](bvar.md)
做不到完全不共享,那就尽量少共享。在一些读很多的场景下,也许可以降低写的频率以减少同步cacheline的次数,以加快读的平均性能。一个相关的编程陷阱是避免false sharing:这指的是那些不怎么被修改的变量,由于同一个cacheline中的另一个变量被频繁修改,而不得不经常等待cacheline同步而显著变慢了。多线程中的变量尽量按访问规律排列,频繁被其他线程的修改要放在独立的cacheline中。要让一个变量或结构体按cacheline对齐,可以include <butil/macros.h>然后使用BAIDU_CACHELINE_ALIGNMENT宏,用法请自行grep一下brpc的代码了解。
......
We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem](https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
We know that locks are extensively used in multi-thread programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. When a lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in practice and it is hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This artical tries to introduce some basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
As the name suggests, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state will not be observed. Common atomic instructions include:
As the name implies, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
| Atomic Instructions(type of x is std::atomic<int>) | effect |
| Atomic Instructions(type of x is std::atomic\<int\>) | Descriptions |
| ---------------------------------------- | ---------------------------------------- |
| x.load() | return the value of x. |
| x.store(n) | store n to x, nothing to return. |
| x.exchange(n) | set x to n, and return the previous value |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, x is set to desired and true is returned. Otherwise write current value to expected_ref and false is returned. |
| x.compare_exchange_weak(expected_ref, desired) | When compared to compare_exchange_strong, it may suffer from [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup) |
| x.fetch_add(n), x.fetch_sub(n), x.fetch_xxx(n) | x += n, x-= n(or more instructions),the value before modification is returned. |
| x.load() | return the value of x. |
| x.store(n) | store n to x and return nothing. |
| x.exchange(n) | set x to n and return the value just before the atomical set |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, set x to desired and return true. Otherwise write current x to expected_ref and return false. |
| x.compare_exchange_weak(expected_ref, desired) | may have [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup) comparing to compare_exchange_strong |
| x.fetch_add(n), x.fetch_sub(n) | do x += n, x-= n atomically. Return the value just before the modification. |
You can already use these instructions to do atomic counting, such as multiple threads at the same time accumulate an atomic variable to count the number of operation on some resources by these threads. But this may cause two problems:
You can already use these instructions to count stuff atomically, such as counting number of operations on resources used by multiple threads simultaneously. However two problems may arise:
- The operation is not as fast as you expect.
- If you try to control some of the resources through seemingly simple atomic operations, your program has a lot of chance to crash.
- The operation is not as fast as expected.
- If multi-threaded accesses to some resources are controlled by a few atomic operations that seem to be correct, the program still has great chance to crash.
# Cacheline
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
An atomic instruction is fast when there's no contentions or accessed by only one thread. "Contentions" happen when multiple threads access the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use caches and divide caches into multiple levels to get high performance with a low price. The Intel E5-2620 widely used in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is very fast for one core to write data into its own L1 cache(4 cycles, ~2ns), let the data in L1 cache be seen by other cores is not, because cachelines touched by the data need to be synchronized to other cores. This process is atomic and transparent to software and no instructions can be interleaved between. Applications have wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes much longer time than writing local cache. It involves a complicated algorithm and makes atomic instructions slow under high contentions. A single fetch_add may take more than 700ns in E5-2620 when a few threads are highly contented on the instruction. Accesses to the memory frequently shared and modified by multiple threads are not fast generally. For example, even if the critical section is small, using a spinlock may still not work well. The cause is that the instructions used in spinlock such as exchange, fetch_add etc, need to wait for latest cachelines. It's not surprising to see that one or two instructions take several microseconds.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds. The most effective solution is straightforward: **avoid sharing as possible as you can**. Avoiding contention from the beginning is the best.
In order to improve performance, we need to avoid synchronizing cachelines frequently, which not only affects performance of the atomic instruction itself, but also overall performance of the program. The most effective solution is straightforward: **avoid sharing as much as possible**. Avoiding contentions from the very beginning is the best strategy:
- A program using a global multiple-producer-multiple-consumer(MPMC) queue is hard to have multi-core scalability, since the limit throughput of this queue depends on the delay of cpu cache synchronization, rather than the number of cores. It is a best practice to use multiple SPMC or multiple MPSC queue, or even multiple SPSC queue instead, avoid contention at the beginning.
- Another example is global counter. If all threads modify a global variable frequently, the performance would be poor because all cores are busy synchronizing the same cacheline. If the counter is only used to print logs or something like that, we can let each thread modify thread-local variables and combine all the data when need. This may cause performance difference several times.
- A program relying on a global multiple-producer-multiple-consumer(MPMC) queue is hard to scale well on many cpu cores, since throughput of the queue is limited by delays of cache coherence, rather than the number of cores. It would be better to use multiple SPMC or MPSC queues, or even SPSC queues instead, to avoid contentions from the beginning.
- Another example is global counter. If all threads modify a global variable frequently, the performance will be poor because all cores are busy at synchronizing the same cacheline. If the counter is only used for printing logs periodically or something like that, we can let each thread modify its own thread-local variables and combine all thread-local data for a read, resulting a [much better performance](bvar.md).
# Memory fence
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment