We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem](https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
We know that locks are extensively used in multi-thread programming to avoid [race conditions](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. When a lock becomes a bottleneck, we try to walk around it by using atomic instructions. But it is difficult to write correct code with atomic instructions in practice and it is hard to understand race conditions, [ABA problems](https://en.wikipedia.org/wiki/ABA_problem) and [memory fences](https://en.wikipedia.org/wiki/Memory_barrier). This artical tries to introduce some basics on atomic instructions(under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing)). Since [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11, we use the APIs directly.
As the name suggests, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state will not be observed. Common atomic instructions include:
As the name implies, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state is not observable **to software**. Common atomic instructions are listed below:
| Atomic Instructions(type of x is std::atomic<int>) | effect |
| Atomic Instructions(type of x is std::atomic\<int\>) | Descriptions |
| x.exchange(n) | set x to n, and return the previous value |
| x.exchange(n) | set x to n and return the value just before the atomical set |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, x is set to desired and true is returned. Otherwise write current value to expected_ref and false is returned. |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, set x to desired and return true. Otherwise write current x to expected_ref and return false. |
| x.compare_exchange_weak(expected_ref, desired) | When compared to compare_exchange_strong, it may suffer from [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup)。 |
| x.compare_exchange_weak(expected_ref, desired) | may have [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup) comparing to compare_exchange_strong |
| x.fetch_add(n), x.fetch_sub(n), x.fetch_xxx(n) | x += n, x-= n(or more instructions),the value before modification is returned. |
| x.fetch_add(n), x.fetch_sub(n) | do x += n, x-= n atomically. Return the value just before the modification. |
You can already use these instructions to do atomic counting, such as multiple threads at the same time accumulate an atomic variable to count the number of operation on some resources by these threads. But this may cause two problems:
You can already use these instructions to count stuff atomically, such as counting number of operations on resources used by multiple threads simultaneously. However two problems may arise:
- The operation is not as fast as you expect.
- The operation is not as fast as expected.
- If you try to control some of the resources through seemingly simple atomic operations, your program has a lot of chance to crash.
- If multi-threaded accesses to some resources are controlled by a few atomic operations that seem to be correct, the program still has great chance to crash.
# Cacheline
# Cacheline
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
An atomic instruction is fast when there's no contentions or accessed by only one thread. "Contentions" happen when multiple threads access the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use caches and divide caches into multiple levels to get high performance with a low price. The Intel E5-2620 widely used in Baidu has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Although it is very fast for one core to write data into its own L1 cache(4 cycles, ~2ns), let the data in L1 cache be seen by other cores is not, because cachelines touched by the data need to be synchronized to other cores. This process is atomic and transparent to software and no instructions can be interleaved between. Applications have wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes much longer time than writing local cache. It involves a complicated algorithm and makes atomic instructions slow under high contentions. A single fetch_add may take more than 700ns in E5-2620 when a few threads are highly contented on the instruction. Accesses to the memory frequently shared and modified by multiple threads are not fast generally. For example, even if the critical section is small, using a spinlock may still not work well. The cause is that the instructions used in spinlock such as exchange, fetch_add etc, need to wait for latest cachelines. It's not surprising to see that one or two instructions take several microseconds.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds. The most effective solution is straightforward: **avoid sharing as possible as you can**. Avoiding contention from the beginning is the best.
In order to improve performance, we need to avoid synchronizing cachelines frequently, which not only affects performance of the atomic instruction itself, but also overall performance of the program. The most effective solution is straightforward: **avoid sharing as much as possible**. Avoiding contentions from the very beginning is the best strategy:
- A program using a global multiple-producer-multiple-consumer(MPMC) queue is hard to have multi-core scalability, since the limit throughput of this queue depends on the delay of cpu cache synchronization, rather than the number of cores. It is a best practice to use multiple SPMC or multiple MPSC queue, or even multiple SPSC queue instead, avoid contention at the beginning.
- A program relying on a global multiple-producer-multiple-consumer(MPMC) queue is hard to scale well on many cpu cores, since throughput of the queue is limited by delays of cache coherence, rather than the number of cores. It would be better to use multiple SPMC or MPSC queues, or even SPSC queues instead, to avoid contentions from the beginning.
- Another example is global counter. If all threads modify a global variable frequently, the performance would be poor because all cores are busy synchronizing the same cacheline. If the counter is only used to print logs or something like that, we can let each thread modify thread-local variables and combine all the data when need. This may cause performance difference several times.
- Another example is global counter. If all threads modify a global variable frequently, the performance will be poor because all cores are busy at synchronizing the same cacheline. If the counter is only used for printing logs periodically or something like that, we can let each thread modify its own thread-local variables and combine all thread-local data for a read, resulting a [much better performance](bvar.md).