Commit fb1bdffa authored by zhujiashun's avatar zhujiashun

Translating docs/en/atomic_instructions.md

parent 4e7c7bab
......@@ -18,7 +18,7 @@
# Cacheline
没有任何竞争或只被一个线程访问的原子操作是比较快的,“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能,大量使用了cache,并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache,256K的L2 cache和15M的L3 cache。其中L1和L2cache为每个核心独有,L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, 2 ns),但当另一个核心读或写同一处内存时,它得确认看到其他核心中对应的cacheline。对于软件来说,这个过程是原子的,不能在中间穿插其他代码,只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence),这个复杂的算法相比其他操作耗时会很长,在E5-2620上竞争激烈时大约在700ns左右。所以访问被多个线程频繁共享的内存是比较慢的。
没有任何竞争或只被一个线程访问的原子操作是比较快的,“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能,大量使用了cache,并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache,256K的L2 cache和15M的L3 cache。其中L1和L2cache为每个核心独有,L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, 2ns),但当另一个核心读或写同一处内存时,它得确认看到其他核心中对应的cacheline。对于软件来说,这个过程是原子的,不能在中间穿插其他代码,只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence),这个复杂的算法相比其他操作耗时会很长,在E5-2620上竞争激烈时大约在700ns左右。所以访问被多个线程频繁共享的内存是比较慢的。
要提高性能,就要避免让CPU同步cacheline。这不单和原子指令本身的性能有关,还会影响到程序的整体性能。比如像一些临界区很小的场景,使用spinlock效果仍然不佳,问题就在于实现spinlock使用的exchange,fetch_add等指令必须在CPU同步好最新的cacheline后才能完成,看上去只有几条指令,花费若干微秒却不奇怪。最有效的解决方法很直白:**尽量避免共享**。从源头规避掉竞争是最好的,有竞争就要协调,而协调总是很难的。
......
We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem]((https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
As the name suggests, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state will not be observed. Common atomic instructions include:
| Atomic Instructions(type of x is std::atomic<int>) | effect |
| ---------------------------------------- | ---------------------------------------- |
| x.load() | return the value of x. |
| x.store(n) | |
| x.exchange(n) | set x to n, and return the previous value |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, x is set to desired and true is returned. Otherwise write current value to expected_ref and false is returned. |
| x.compare_exchange_weak(expected_ref, desired) | When compared to compare_exchange_strong, it may suffer from [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup)。 |
| x.fetch_add(n), x.fetch_sub(n), x.fetch_xxx(n) | x += n, x-= n(or more instructions),the value before modification is returned. |
You can already use these instructions to do atomic counting, such as multiple threads at the same time accumulate an atomic variable to count the number of operation on some resources by these threads. But this may cause two problems:
- The operation is not as fast as you expect.
- If you try to control some of the resources through seemingly simple atomic operations, your program has a lot of chance to crash.
# Cacheline
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Althouth it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment