Merge branch 'master' of github.com:brpc/brpc

a85155eb · Zhangyi Chen · 0d03b1c4 · fb1bdffa · a85155eb · a85155eb
Commit a85155eb authored Sep 22, 2017 by Zhangyi Chen
6 changed files
--- a/config_brpc.sh
+++ b/config_brpc.sh
-if [ -z "$BASH" ]; then
+SYSTEM=$(uname -s)
+if [ "$SYSTEM" = "Darwin" ]; then
    ECHO=echo
+    SO=dylib
+    LDD="otool -L"
+    if [ "$(getopt -V)" = " --" ]; then
+        >&2 $ECHO "gnu-getopt must be installed and used"
+        exit 1
+    fi
 else
-    ECHO='echo -e'
+    if [ -z "$BASH" ]; then
+        ECHO=echo
+    else
+        ECHO='echo -e'
+    fi
+    SO=so
+    LDD=ldd
 fi
-# NOTE: This requires GNU getopt.  On Mac OS X and FreeBSD, you have to install this
-# separately; see below.
+
 TEMP=`getopt -o v: --long headers:,libs:,cc:,cxx:,with-glog -n 'config_brpc' -- "$@"`
 WITH_GLOG=0

@@ -16,8 +28,8 @@ eval set -- "$TEMP"
 # Convert to abspath always so that generated mk is include-able from everywhere
 while true; do
    case "$1" in
-        --headers ) HDRS_IN="$(readlink -f $2)"; shift 2 ;;
-        --libs ) LIBS_IN="$(readlink -f $2)"; shift 2 ;;
+        --headers ) HDRS_IN="$(realpath $2)"; shift 2 ;;
+        --libs ) LIBS_IN="$(realpath $2)"; shift 2 ;;
        --cc ) CC=$2; shift 2 ;;
        --cxx ) CXX=$2; shift 2 ;;
        --with-glog ) WITH_GLOG=1; shift 1 ;;
@@ -50,7 +62,7 @@ if [ -z "$HDRS_IN" ] || [ -z "$LIBS_IN" ]; then
 fi

 find_dir_of_lib() {
-    local lib=$(find ${LIBS_IN} -name "lib${1}.a" -o -name "lib${1}.so" | head -n1)
+    local lib=$(find ${LIBS_IN} -name "lib${1}.a" -o -name "lib${1}.$SO" | head -n1)
    if [ ! -z "$lib" ]; then
        dirname $lib
    fi
@@ -128,8 +140,8 @@ append_linking $PROTOBUF_LIB protobuf
 LEVELDB_LIB=$(find_dir_of_lib_or_die leveldb)
 # required by leveldb
 if [ -f $LEVELDB_LIB/libleveldb.a ]; then
-    if [ -f $LEVELDB_LIB/libleveldb.so ]; then
-        if ldd $LEVELDB_LIB/libleveldb.so | grep -q libsnappy; then
+    if [ -f $LEVELDB_LIB/libleveldb.$SO ]; then
+        if $LDD $LEVELDB_LIB/libleveldb.$SO | grep -q libsnappy; then
            SNAPPY_LIB=$(find_dir_of_lib snappy)
            REQUIRE_SNAPPY="yes"
        fi
@@ -247,8 +259,8 @@ if [ -z "$TCMALLOC_LIB" ]; then
 else
    append_to_output_libs "$TCMALLOC_LIB" "    "
    if [ -f $TCMALLOC_LIB/libtcmalloc_and_profiler.a ]; then
-        if [ -f $TCMALLOC_LIB/libtcmalloc.so ]; then
-            ldd $TCMALLOC_LIB/libtcmalloc.so > libtcmalloc.deps
+        if [ -f $TCMALLOC_LIB/libtcmalloc.$SO ]; then
+            $LDD $TCMALLOC_LIB/libtcmalloc.$SO > libtcmalloc.deps
            if grep -q libunwind libtcmalloc.deps; then
                TCMALLOC_REQUIRE_UNWIND="yes"
                REQUIRE_UNWIND="yes"
@@ -283,8 +295,8 @@ if [ $WITH_GLOG != 0 ]; then
    else
        append_to_output_libs "$GLOG_LIB" "    "
        if [ -f $GLOG_LIB/libglog.a ]; then
-            if [ -f "$GLOG_LIB/libglog.so" ]; then
-                ldd $GLOG_LIB/libglog.so > libglog.deps 
+            if [ -f "$GLOG_LIB/libglog.$SO" ]; then
+                $LDD $GLOG_LIB/libglog.$SO > libglog.deps
                if grep -q libunwind libglog.deps; then
                    GLOG_REQUIRE_UNWIND="yes"
                    REQUIRE_UNWIND="yes"

--- a/docs/cn/atomic_instructions.md
+++ b/docs/cn/atomic_instructions.md
@@ -18,7 +18,7 @@

 # Cacheline

-没有任何竞争或只被一个线程访问的原子操作是比较快的，“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能，大量使用了cache，并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache，256K的L2 cache和15M的L3 cache。其中L1和L2cache为每个核心独有，L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, 2 ns)，但当另一个核心读或写同一处内存时，它得确认看到其他核心中对应的cacheline。对于软件来说，这个过程是原子的，不能在中间穿插其他代码，只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence)，这个复杂的算法相比其他操作耗时会很长，在E5-2620上竞争激烈时大约在700ns左右。所以访问被多个线程频繁共享的内存是比较慢的。
+没有任何竞争或只被一个线程访问的原子操作是比较快的，“竞争”指的是多个线程同时访问同一个[cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries)。现代CPU为了以低价格获得高性能，大量使用了cache，并把cache分了多级。百度内常见的Intel E5-2620拥有32K的L1 dcache和icache，256K的L2 cache和15M的L3 cache。其中L1和L2cache为每个核心独有，L3则所有核心共享。一个核心写入自己的L1 cache是极快的(4 cycles, 2ns)，但当另一个核心读或写同一处内存时，它得确认看到其他核心中对应的cacheline。对于软件来说，这个过程是原子的，不能在中间穿插其他代码，只能等待CPU完成[一致性同步](https://en.wikipedia.org/wiki/Cache_coherence)，这个复杂的算法相比其他操作耗时会很长，在E5-2620上竞争激烈时大约在700ns左右。所以访问被多个线程频繁共享的内存是比较慢的。

 要提高性能，就要避免让CPU同步cacheline。这不单和原子指令本身的性能有关，还会影响到程序的整体性能。比如像一些临界区很小的场景，使用spinlock效果仍然不佳，问题就在于实现spinlock使用的exchange，fetch_add等指令必须在CPU同步好最新的cacheline后才能完成，看上去只有几条指令，花费若干微秒却不奇怪。最有效的解决方法很直白：**尽量避免共享**。从源头规避掉竞争是最好的，有竞争就要协调，而协调总是很难的。


--- a/docs/cn/getting_started.md
+++ b/docs/cn/getting_started.md
@@ -11,9 +11,21 @@ brpc depends on following packages:

 ## Ubuntu/LinuxMint/WSL
 ### Prepare deps
-install common deps: `git g++ make libssl-dev`

-install [gflags](https://github.com/gflags/gflags), [protobuf](https://github.com/google/protobuf), [leveldb](https://github.com/google/leveldb), including: `libgflags-dev libprotobuf-dev libprotoc-dev protobuf-compiler libleveldb-dev`. If you need to statically link leveldb, install `libsnappy-dev` as well.
+Install common deps:
+```
+$ sudo apt-get install git g++ make libssl-dev
+```
+
+Install [gflags](https://github.com/gflags/gflags), [protobuf](https://github.com/google/protobuf), [leveldb](https://github.com/google/leveldb):
+```
+$ sudo apt-get install libgflags-dev libprotobuf-dev libprotoc-dev protobuf-compiler libleveldb-dev
+```
+
+If you need to statically link leveldb:
+```
+$ sudo apt-get install libsnappy-dev
+```

 ### Compile brpc
 git clone brpc, cd into the repo and run
@@ -36,9 +48,10 @@ Examples link brpc statically, if you need to link the shared version, `make cle
 To run examples with cpu/heap profilers, install `libgoogle-perftools-dev` and re-run `config_brpc.sh` before compiling.

 ### Run tests
-Install libgtest-dev (which is not compiled yet) and run:
+Install and compile libgtest-dev (which is not compiled yet):

 ```shell
+sudo apt-get install libgtest-dev
 cd /usr/src/gtest && sudo cmake . && sudo make && sudo mv libgtest* /usr/lib/
 ```

@@ -50,10 +63,15 @@ Rerun `config_brpc.sh`, `make` in test/, and `sh run_tests.sh`

 ### Prepare deps

-install common deps: `git g++ make openssl-devel`
-
-install [gflags](https://github.com/gflags/gflags), [protobuf](https://github.com/google/protobuf), [leveldb](https://github.com/google/leveldb), including: `gflags-devel protobuf-devel protobuf-compiler leveldb-devel`.
+Install common deps:
+```
+sudo yum install git g++ make openssl-devel
+```

+Install [gflags](https://github.com/gflags/gflags), [protobuf](https://github.com/google/protobuf), [leveldb](https://github.com/google/leveldb):
+```
+sudo yum install gflags-devel protobuf-devel protobuf-compiler leveldb-devel
+```
 ### Compile brpc

 git clone brpc, cd into the repo and run
@@ -88,11 +106,15 @@ Rerun `config_brpc.sh`, `make` in test/, and `sh run_tests.sh`

 brpc builds itself to both static and shared libs by default, so it needs static and shared libs of deps to be built as well.

-Take [gflags](https://github.com/gflags/gflags) as example, which does not build shared lib by default, you need to pass options to `cmake` to change the behavior, like this:  `cmake . -DBUILD_SHARED_LIBS=1 -DBUILD_STATIC_LIBS=1`  then `make`.
+Take [gflags](https://github.com/gflags/gflags) as example, which does not build shared lib by default, you need to pass options to `cmake` to change the behavior:
+```
+cmake . -DBUILD_SHARED_LIBS=1 -DBUILD_STATIC_LIBS=1
+make
+```

 ### Compile brpc

-Keep on with the gflags example, let `../gflags_dev` be where you clone gflags.
+Keep on with the gflags example, let `../gflags_dev` be where gflags is cloned.

 git clone brpc. cd into the repo and run

@@ -113,8 +135,6 @@ $ sh config_brpc.sh --headers=.. --libs=..
 $ make
 ```

-Note: don't put ~ (tilde) in paths to --headers/--libs, it's not converted.
-
 # Supported deps

 ## GCC: 4.8-7.1

--- a/docs/en/atomic_instructions.md
+++ b/docs/en/atomic_instructions.md
+We all know that locks are needed in multi-thread programming to avoid potential [race condition](http://en.wikipedia.org/wiki/Race_condition) when modifying the same data. But In practice, it is difficult to write correct codes using atomic instructions. It is hard to understand race condition, [ABA problem]((https://en.wikipedia.org/wiki/ABA_problem), [memory fence](https://en.wikipedia.org/wiki/Memory_barrier). This artical is to help you get started by introducing atomic instructions under [SMP](http://en.wikipedia.org/wiki/Symmetric_multiprocessing). [Atomic instructions](http://en.cppreference.com/w/cpp/atomic/atomic) are formally introduced in C++11.
+
+As the name suggests, atomic instructions cannot be divided into sub-instructions. For example, `x.fetch(n)` atomically adds n to x, any internal state will not be observed. Common atomic instructions include:
+
+| Atomic Instructions(type of x is std::atomic<int>)               | effect                                       |
+| ---------------------------------------- | ---------------------------------------- |
+| x.load()                                 | return the value of x.                                   |
+| x.store(n)                               |                             |
+| x.exchange(n)                            | set x to n, and return the previous value                          |
+| x.compare_exchange_strong(expected_ref, desired) | If x is equal to expected_ref, x is set to desired and true is returned. Otherwise write current value to expected_ref and false is returned. |
+| x.compare_exchange_weak(expected_ref, desired) | When compared to compare_exchange_strong, it may suffer from [spurious wakeup](http://en.wikipedia.org/wiki/Spurious_wakeup)。 |
+| x.fetch_add(n), x.fetch_sub(n), x.fetch_xxx(n) | x += n, x-= n(or more instructions)，the value before modification is returned.           |
+
+You can already use these instructions to do atomic counting, such as multiple threads at the same time accumulate an atomic variable to count the number of operation on some resources by these threads. But this may cause two problems:
+
+- The operation is not as fast as you expect.
+- If you try to control some of the resources through seemingly simple atomic operations, your program has a lot of chance to crash.
+
+# Cacheline
+
+An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620  has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Althouth it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
+
+In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds.
--- a/docs/en/memcache_client.md
+++ b/docs/en/memcache_client.md
+[memcached](http://memcached.org/) is a common cache service today. In order to speed up the access to memcached and make full use of bthread concurrency, brpc directly support the memcached protocol. For examples please refer to: [example/memcache_c++](https://github.com/brpc/brpc/tree/master/example/memcache_c++/)
+
+**NOTE**: brpc only supports the binary protocol of memcache rather than the textual one before version 1.3 since there is little benefit to do that now. If your memcached has a version earlier than 1.3, please upgrade to the latest.
+
+Compared to [libmemcached](http://libmemcached.org/libMemcached.html) (the official client), we have advantages in:
+
+- Thread safety. No need to set up a separate client for each thread.
+- Support access patterns of synchronous, asynchronous, batch synchronous, batch asynchronous. Can be used with ParallelChannel to enable access combinations.
+- Support various [connection types](client.md#Connection Type). Support timeout, backup request, cancellation, tracing, built-in services, and other basic benefits of the RPC framework.
+- Have the concept of request/response while libmemcached haven't, where users have to do extra maintenance since the received message doesn't have a relationship with the sent message.
+
+The current implementation takes full advantage of the RPC concurrency mechanism to avoid copying as much as possible. A single client can easily reaches the limit of a memcached instance (version 1.4.15) on the same machine: 90,000 QPS for single connection, 330,000 QPS for multiple connections. In most cases, brpc should be able to make full use of memcached's performance.
+
+# Request to single memcached
+
+Create a `Channel` to access memcached:
+
+```c++
+#include <brpc/memcache.h>
+#include <brpc/channel.h>
+ 
+ChannelOptions options;
+options.protocol = brpc::PROTOCOL_MEMCACHE;
+if (channel.Init("0.0.0.0:11211", &options) != 0) {  // 11211 is the default port for memcached
+   LOG(FATAL) << "Fail to init channel to memcached";
+   return -1;
+}
+... 
+```
+
+Set data to memcached
+
+```c++
+// Set key="hello" value="world" flags=0xdeadbeef, expire in 10s, and ignore cas
+brpc::MemcacheRequest request;
+brpc::MemcacheResponse response;
+brpc::Controller cntl;
+if (!request.Set("hello", "world", 0xdeadbeef/*flags*/, 10/*expiring seconds*/, 0/*ignore cas*/)) {
+    LOG(FATAL) << "Fail to SET request";
+    return -1;
+} 
+channel.CallMethod(NULL, &cntl, &request, &response, NULL/*done*/);
+if (cntl.Failed()) {
+    LOG(FATAL) << "Fail to access memcached, " << cntl.ErrorText();
+    return -1;
+}  
+if (!response.PopSet(NULL)) {
+    LOG(FATAL) << "Fail to SET memcached, " << response.LastError();
+    return -1;   
+}
+...
+```
+
+There are some notes on the above code:
+
+- The class of the request must be `MemcacheRequest`, and `MemcacheResponse` for the response, otherwise `CallMethod` will fail. `stub` is not necessary. Just call `channel.CallMethod` with `method` set to NULL.
+- Call `request.XXX()` to add operation, where `XXX=Set` in this case. Multiple operations on a single request will be sent to memcached in batch (often referred to as pipeline mode).
+- call `response.PopXXX()` pop-up operation results, where `XXX=Set` in this case. Return true on success, and false on failure, in which case use `response.LastError()` to get the error message. Operation `XXX` must correspond to request, otherwise it will fail. In the above example, a `PopGet` will fail with the error message of "not a GET response".
+- The results of `Pop` are independent of RPC result. Even if `Set` fails, RPC may still be successful. RPC failure means things like broken connection, timeout, and so on . *Can not put a value into memcached* is  still a successful RPC. AS a reulst, in order to make sure success of the entire process, you need to not only determine the success of RPC, but also the success of `PopXXX`.
+
+Currently our supported operations are:
+
+```c++
+bool Set(const Slice& key, const Slice& value, uint32_t flags, uint32_t exptime, uint64_t cas_value);
+bool Add(const Slice& key, const Slice& value, uint32_t flags, uint32_t exptime, uint64_t cas_value);
+bool Replace(const Slice& key, const Slice& value, uint32_t flags, uint32_t exptime, uint64_t cas_value);
+bool Append(const Slice& key, const Slice& value, uint32_t flags, uint32_t exptime, uint64_t cas_value);
+bool Prepend(const Slice& key, const Slice& value, uint32_t flags, uint32_t exptime, uint64_t cas_value);
+bool Delete(const Slice& key);
+bool Flush(uint32_t timeout);
+bool Increment(const Slice& key, uint64_t delta, uint64_t initial_value, uint32_t exptime);
+bool Decrement(const Slice& key, uint64_t delta, uint64_t initial_value, uint32_t exptime);
+bool Touch(const Slice& key, uint32_t exptime);
+bool Version();
+```
+
+And the corresponding reply operations:
+
+```c++
+// Call LastError() of the response to check the error text when any following operation fails.
+bool PopGet(IOBuf* value, uint32_t* flags, uint64_t* cas_value);
+bool PopGet(std::string* value, uint32_t* flags, uint64_t* cas_value);
+bool PopSet(uint64_t* cas_value);
+bool PopAdd(uint64_t* cas_value);
+bool PopReplace(uint64_t* cas_value);
+bool PopAppend(uint64_t* cas_value);
+bool PopPrepend(uint64_t* cas_value);
+bool PopDelete();
+bool PopFlush();
+bool PopIncrement(uint64_t* new_value, uint64_t* cas_value);
+bool PopDecrement(uint64_t* new_value, uint64_t* cas_value);
+bool PopTouch();
+bool PopVersion(std::string* version);
+```
+
+# Access to memcached cluster
+
+If you want to access a memcached cluster mounted on some naming service, you should create a `Channel` that uses the c_md5 as the load balancing algorithm and make sure each `MemcacheRequest` contains only one operation or all operations fall on the same server. Since under the current implementation, multiple operations inside a single request will always be sent to the same server. For example, if a request contains a number of Get while the corresponding keys distribute in different servers, the result must be wrong, in which case you have to separate the request according to key distribution.
+
+Another choice is to follow the common [twemproxy](https://github.com/twitter/twemproxy) style. This allows the client can still access the cluster just like a single point, although it requires deployment of the proxy and the additional latency.
\ No newline at end of file
--- a/docs/en/redis_client.md
+++ b/docs/en/redis_client.md
@@ -144,7 +144,7 @@ Call `Clear()` to reuse the `RedisRespones` object.

 For now please use [twemproxy](https://github.com/twitter/twemproxy) as a common way to wrap redis cluster so that it can be used just like a single node proxy, in which case you can just replace your hiredis with brpc. Accessing the cluster directly from client (using consistent hash) may reduce the delay, but at the cost of other management services. Make sure to double check that in redis document.

-If you maintain a redis cluster like the memcache all by yourself, it should be accessible using consistent hash. In general, you have to make sure each `RedisRequest` contains only one command or keys from multiple commands fall on the same server, since under the current implementation, if a request contains multiple commands, it will always be sent to the same server. For example, if a request contains a number of Get while the corresponding keys distribute in multiple servers, the result must be wrong, in which case you have to separate the request according to key distribution.
+If you maintain a redis cluster like the memcache all by yourself, it should be accessible using consistent hash. In general, you have to make sure each `RedisRequest` contains only one command or keys from multiple commands fall on the same server, since under the current implementation, if a request contains multiple commands, it will always be sent to the same server. For example, if a request contains a number of Get while the corresponding keys distribute in different servers, the result must be wrong, in which case you have to separate the request according to key distribution.

 # Debug