@@ -11,9 +11,21 @@ brpc depends on following packages:
...
@@ -11,9 +11,21 @@ brpc depends on following packages:
## Ubuntu/LinuxMint/WSL
## Ubuntu/LinuxMint/WSL
### Prepare deps
### Prepare deps
install common deps: `git g++ make libssl-dev`
install [gflags](https://github.com/gflags/gflags), [protobuf](https://github.com/google/protobuf), [leveldb](https://github.com/google/leveldb), including: `libgflags-dev libprotobuf-dev libprotoc-dev protobuf-compiler libleveldb-dev`. If you need to statically link leveldb, install `libsnappy-dev` as well.
@@ -88,11 +106,15 @@ Rerun `config_brpc.sh`, `make` in test/, and `sh run_tests.sh`
...
@@ -88,11 +106,15 @@ Rerun `config_brpc.sh`, `make` in test/, and `sh run_tests.sh`
brpc builds itself to both static and shared libs by default, so it needs static and shared libs of deps to be built as well.
brpc builds itself to both static and shared libs by default, so it needs static and shared libs of deps to be built as well.
Take [gflags](https://github.com/gflags/gflags) as example, which does not build shared lib by default, you need to pass options to `cmake` to change the behavior, like this: `cmake . -DBUILD_SHARED_LIBS=1 -DBUILD_STATIC_LIBS=1` then `make`.
Take [gflags](https://github.com/gflags/gflags) as example, which does not build shared lib by default, you need to pass options to `cmake` to change the behavior:
We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem]((https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
As the name suggests, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state will not be observed. Common atomic instructions include:
| Atomic Instructions(type of x is std::atomic<int>) | effect |
| x.exchange(n) | set x to n, and return the previous value |
| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, x is set to desired and true is returned. Otherwise write current value to expected_ref and false is returned. |
| x.compare_exchange_weak(expected_ref, desired) | When compared to compare_exchange_strong, it may suffer from [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup)。 |
| x.fetch_add(n), x.fetch_sub(n), x.fetch_xxx(n) | x += n, x-= n(or more instructions),the value before modification is returned. |
You can already use these instructions to do atomic counting, such as multiple threads at the same time accumulate an atomic variable to count the number of operation on some resources by these threads. But this may cause two problems:
- The operation is not as fast as you expect.
- If you try to control some of the resources through seemingly simple atomic operations, your program has a lot of chance to crash.
# Cacheline
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Althouth it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds.
[memcached](http://memcached.org/) is a common cache service today. In order to speed up the access to memcached and make full use of bthread concurrency, brpc directly support the memcached protocol. For examples please refer to: [example/memcache_c++](https://github.com/brpc/brpc/tree/master/example/memcache_c++/)
**NOTE**: brpc only supports the binary protocol of memcache rather than the textual one before version 1.3 since there is little benefit to do that now. If your memcached has a version earlier than 1.3, please upgrade to the latest.
Compared to [libmemcached](http://libmemcached.org/libMemcached.html)(the official client), we have advantages in:
- Thread safety. No need to set up a separate client for each thread.
- Support access patterns of synchronous, asynchronous, batch synchronous, batch asynchronous. Can be used with ParallelChannel to enable access combinations.
- Support various [connection types](client.md#Connection Type). Support timeout, backup request, cancellation, tracing, built-in services, and other basic benefits of the RPC framework.
- Have the concept of request/response while libmemcached haven't, where users have to do extra maintenance since the received message doesn't have a relationship with the sent message.
The current implementation takes full advantage of the RPC concurrency mechanism to avoid copying as much as possible. A single client can easily reaches the limit of a memcached instance (version 1.4.15) on the same machine: 90,000 QPS for single connection, 330,000 QPS for multiple connections. In most cases, brpc should be able to make full use of memcached's performance.
# Request to single memcached
Create a `Channel` to access memcached:
```c++
#include <brpc/memcache.h>
#include <brpc/channel.h>
ChannelOptionsoptions;
options.protocol=brpc::PROTOCOL_MEMCACHE;
if(channel.Init("0.0.0.0:11211",&options)!=0){// 11211 is the default port for memcached
LOG(FATAL)<<"Fail to init channel to memcached";
return-1;
}
...
```
Set data to memcached
```c++
// Set key="hello" value="world" flags=0xdeadbeef, expire in 10s, and ignore cas
LOG(FATAL)<<"Fail to access memcached, "<<cntl.ErrorText();
return-1;
}
if(!response.PopSet(NULL)){
LOG(FATAL)<<"Fail to SET memcached, "<<response.LastError();
return-1;
}
...
```
There are some notes on the above code:
- The class of the request must be `MemcacheRequest`, and `MemcacheResponse` for the response, otherwise `CallMethod` will fail. `stub` is not necessary. Just call `channel.CallMethod` with `method` set to NULL.
- Call `request.XXX()` to add operation, where `XXX=Set` in this case. Multiple operations on a single request will be sent to memcached in batch (often referred to as pipeline mode).
- call `response.PopXXX()` pop-up operation results, where `XXX=Set` in this case. Return true on success, and false on failure, in which case use `response.LastError()` to get the error message. Operation `XXX` must correspond to request, otherwise it will fail. In the above example, a `PopGet` will fail with the error message of "not a GET response".
- The results of `Pop` are independent of RPC result. Even if `Set` fails, RPC may still be successful. RPC failure means things like broken connection, timeout, and so on . *Can not put a value into memcached* is still a successful RPC. AS a reulst, in order to make sure success of the entire process, you need to not only determine the success of RPC, but also the success of `PopXXX`.
If you want to access a memcached cluster mounted on some naming service, you should create a `Channel` that uses the c_md5 as the load balancing algorithm and make sure each `MemcacheRequest` contains only one operation or all operations fall on the same server. Since under the current implementation, multiple operations inside a single request will always be sent to the same server. For example, if a request contains a number of Get while the corresponding keys distribute in different servers, the result must be wrong, in which case you have to separate the request according to key distribution.
Another choice is to follow the common [twemproxy](https://github.com/twitter/twemproxy) style. This allows the client can still access the cluster just like a single point, although it requires deployment of the proxy and the additional latency.
@@ -144,7 +144,7 @@ Call `Clear()` to reuse the `RedisRespones` object.
...
@@ -144,7 +144,7 @@ Call `Clear()` to reuse the `RedisRespones` object.
For now please use [twemproxy](https://github.com/twitter/twemproxy) as a common way to wrap redis cluster so that it can be used just like a single node proxy, in which case you can just replace your hiredis with brpc. Accessing the cluster directly from client (using consistent hash) may reduce the delay, but at the cost of other management services. Make sure to double check that in redis document.
For now please use [twemproxy](https://github.com/twitter/twemproxy) as a common way to wrap redis cluster so that it can be used just like a single node proxy, in which case you can just replace your hiredis with brpc. Accessing the cluster directly from client (using consistent hash) may reduce the delay, but at the cost of other management services. Make sure to double check that in redis document.
If you maintain a redis cluster like the memcache all by yourself, it should be accessible using consistent hash. In general, you have to make sure each `RedisRequest` contains only one command or keys from multiple commands fall on the same server, since under the current implementation, if a request contains multiple commands, it will always be sent to the same server. For example, if a request contains a number of Get while the corresponding keys distribute in multiple servers, the result must be wrong, in which case you have to separate the request according to key distribution.
If you maintain a redis cluster like the memcache all by yourself, it should be accessible using consistent hash. In general, you have to make sure each `RedisRequest` contains only one command or keys from multiple commands fall on the same server, since under the current implementation, if a request contains multiple commands, it will always be sent to the same server. For example, if a request contains a number of Get while the corresponding keys distribute in different servers, the result must be wrong, in which case you have to separate the request according to key distribution.