Merge pull request #738 from zyearn/add_docs_cluster_recover

add docs for cluster recover & fix parameter unit

Merge pull request #738 from zyearn/add_docs_cluster_recover
add docs for cluster recover & fix parameter unit
cc6642bd · Ge Jun · GitHub · 6a545456 · e7db7387 · cc6642bd
Unverified Commit cc6642bd authored Apr 19, 2019 by Ge Jun Committed by GitHub Apr 19, 2019
5 changed files
--- a/docs/cn/client.md
+++ b/docs/cn/client.md
@@ -234,6 +234,15 @@ locality-aware，优先选择延时低的下游，直到其延时高于其他机

 实现原理请查看[Consistent Hashing](consistent_hashing.md)。

+### 从集群宕机后恢复时的客户端限流
+
+集群宕机指的是集群中所有server都处于不可用的状态。由于健康检查机制，当集群恢复正常后，server会间隔性地上线。当某一个server上线后，所有的流量会发送过去，可能导致服务再次过载。若熔断开启，则可能导致其它server上线前该server再次熔断，集群永远无法恢复。作为解决方案，brpc提供了在集群宕机后恢复时的限流机制：当集群中没有可用server时，集群进入恢复状态，假设正好能服务所有请求的server数量为min_working_instances，当前集群可用的server数量为q，则在恢复状态时，client接受请求的概率为q/min_working_instances，否则丢弃；若一段时间hold_seconds内q保持不变，则把流量重新发送全部可用的server上，并离开恢复状态。在恢复阶段时，可以通过判断controller.ErrorCode()是否等于brpc::ERJECT来判断该次请求是否被拒绝，被拒绝的请求不会被框架重试。
+
+此恢复机制要求下游server的能力是类似的，所以目前只针对rr和random有效，开启方式是在*load_balancer_name*后面加上min_working_instances和hold_seconds参数的值，例如：
+```c++
+channel.Init("http://...", "random:min_working_instances=6 hold_seconds=10", &options);
+```
+
 ## 健康检查

 连接断开的server会被暂时隔离而不会被负载均衡算法选中，brpc会定期连接被隔离的server，以检查他们是否恢复正常，间隔由参数-health_check_interval控制:
@@ -515,6 +524,12 @@ r34717后Controller.has_backup_request()获知是否发送过backup_request。
 **重试时框架会尽量避开之前尝试过的server。**

 重试的触发条件有(条件之间是AND关系）：
+
+* 连接出错
+* 没到超时
+* 有剩余重试次数
+* 错误值得重试
+
 ### 连接出错

 如果server一直没有返回，但连接没有问题，这种情况下不会重试。如果你需要在一定时间后发送另一个请求，使用backup request。
@@ -570,6 +585,10 @@ options.retry_policy = &g_my_retry_policy;

 由于成本的限制，大部分线上server的冗余度是有限的，主要是满足多机房互备的需求。而激进的重试逻辑很容易导致众多client对server集群造成2-3倍的压力，最终使集群雪崩：由于server来不及处理导致队列越积越长，使所有的请求得经过很长的排队才被处理而最终超时，相当于服务停摆。默认的重试是比较安全的: 只要连接不断RPC就不会重试，一般不会产生大量的重试请求。用户可以通过RetryPolicy定制重试策略，但也可能使重试变成一场“风暴”。当你定制RetryPolicy时，你需要仔细考虑client和server的协作关系，并设计对应的异常测试，以确保行为符合预期。

+## 熔断
+
+具体方法见[这里](circuit_breaker.md)。
+
 ## 协议

 Channel的默认协议是baidu_std，可通过设置ChannelOptions.protocol换为其他协议，这个字段既接受enum也接受字符串。

--- a/docs/cn/consistent_hashing.md
+++ b/docs/cn/consistent_hashing.md
@@ -9,29 +9,33 @@
 - 分散性 (Spread) : 当上游的机器看到不同的下游列表时(在上线时及不稳定的网络中比较常见),  同一个请求尽量映射到少量的节点中。
 - 负载 (Load) : 当上游的机器看到不同的下游列表的时候， 保证每台下游分到的请求数量尽量一致。

-
-
 # 实现方式

 所有server的32位hash值在32位整数值域上构成一个环(Hash Ring)，环上的每个区间和一个server唯一对应，如果一个key落在某个区间内， 它就被分流到对应的server上。 

 ![img](../images/chash.png)

-当删除一个server的， 它对应的区间会归属于相邻的server，所有的请求都会跑过去。当增加一个server时，它会分割某个server的区间并承载落在这个区间上的所有请求。单纯使用Hash Ring很难满足我们上节提到的属性，主要两个问题：
+当删除一个server的，它对应的区间会归属于相邻的server，所有的请求都会跑过去。当增加一个server时，它会分割某个server的区间并承载落在这个区间上的所有请求。单纯使用Hash Ring很难满足我们上节提到的属性，主要两个问题：

 - 在机器数量较少的时候， 区间大小会不平衡。
 - 当一台机器故障的时候， 它的压力会完全转移到另外一台机器， 可能无法承载。

-为了解决这个问题，我们为每个server计算m个hash值，从而把32位整数值域划分为n*m个区间，当key落到某个区间时，分流到对应的server上。那些额外的hash值使得区间划分更加均匀，被称为Virtual Node。当删除一个server时，它对应的m个区间会分别合入相邻的区间中，那个server上的请求会较为平均地转移到其他server上。当增加server时，它会分割m个现有区间，从对应server上分别转移一些请求过来。
+为了解决这个问题，我们为每个server计算m个hash值，从而把32位整数值域划分为n*m个区间，当key落到某个区间时，分流到对应的server上。那些额外的hash值使得区间划分更加均匀，被称为虚拟节点（Virtual Node）。当删除一个server时，它对应的m个区间会分别合入相邻的区间中，那个server上的请求会较为平均地转移到其他server上。当增加server时，它会分割m个现有区间，从对应server上分别转移一些请求过来。

-由于节点故障和变化不常发生， 我们选择了修改复杂度为O(n)的有序数组来存储hash ring，每次分流使用二分查找来选择对应的机器， 由于存储是连续的，查找效率比基于平衡二叉树的实现高。 线程安全性请参照[Double Buffered Data](lalb.md#doublybuffereddata)章节.
+由于节点故障和变化不常发生，我们选择了修改复杂度为O(n)的有序数组来存储hash ring，每次分流使用二分查找来选择对应的机器，由于存储是连续的，查找效率比基于平衡二叉树的实现高。线程安全性请参照[Double Buffered Data](lalb.md#doublybuffereddata)章节.

 # 使用方式

-我们内置了分别基于murmurhash3和md5两种hash算法的实现， 使用要做两件事：
+我们内置了分别基于murmurhash3和md5两种hash算法的实现，使用要做两件事：

 - 在Channel.Init 时指定*load_balancer_name*为 "c_murmurhash" 或 "c_md5"。
+- 发起rpc时通过Controller::set_request_code(uint64_t)填入请求的hash code。
+
+> request的hash算法并不需要和lb的hash算法保持一致，只需要hash的值域是32位无符号整数。由于memcache默认使用md5，访问memcached集群时请选择c_md5保证兼容性，其他场景可以选择c_murmurhash以获得更高的性能和更均匀的分布。

- 发起rpc时通过Controller::set_request_code()填入请求的hash code。
+# 虚拟节点个数

-> request的hash算法并不需要和lb的hash算法保持一致，只需要hash的值域是32位无符号整数。由于memcache默认使用md5，访问memcached集群时请选择c_md5保证兼容性， 其他场景可以选择c_murmurhash以获得更高的性能和更均匀的分布。
+通过-chash\_num\_replicas可设置默认的虚拟节点个数，默认值为100。对于某些特殊场合，对虚拟节点个数有自定义的需求，可以通过将*load_balancer_name*加上参数replicas=<num>配置，如：
+```c++
+channel.Init("http://...", "c_murmurhash:replicas=150", &options);
+```
--- a/docs/en/client.md
+++ b/docs/en/client.md
@@ -236,6 +236,15 @@ Do distinguish "key" and "attributes" of the request. Don't compute request_code

 Check out [Consistent Hashing](consistent_hashing.md) for more details.

+### Client-side throttling for recovery from cluster downtime
+
+Cluster downtime refers to the state in which all servers in the cluster are unavailable. Due to the health check mechanism, when the cluster returns to normal, server will go online one by one. When a server is online, all traffic will be sent to it, which may cause the service to be overloaded again. If circuit breaker is enabled, server may be offline again before the other servers go online, and the cluster can never be recovered. As a solution, brpc provides a client-side throttling mechanism for recovery after cluster downtime. When no server is available in the cluster, the cluster enters recovery state. Assuming that the minimum number of servers that can serve all requests is min_working_instances, current number of servers available in the cluster is q, then in recovery state, the probability of client accepting the request is q/min_working_instances, otherwise it is discarded. If q remains unchanged for a period of time(hold_seconds), the traffic is resent to all available servers and leaves recovery state. Whether the request is rejected in recovery state is indicated by whether controller.ErrorCode() is equal to brpc::ERJECT, and the rejected request will not be retried by the framework.
+
+This recovery mechanism requires the capabilities of downstream servers to be similar, so it is currently only valid for rr and random. The way to enable it is to add the values of min_working_instances and hold_seconds parameters after *load_balancer_name*, for example:
+```c++
+channel.Init("http://...", "random:min_working_instances=6 hold_seconds=10", &options);
+```
+
 ## Health checking

 Servers whose connections are lost are isolated temporarily to prevent them from being selected by LoadBalancer. brpc connects isolated servers periodically to test if they're healthy again. The interval is controlled by gflag -health_check_interval:
@@ -579,6 +588,10 @@ Some tips：

 Due to maintaining costs, even very large scale clusters are deployed with "just enough" instances to survive major defects, namely offline of one IDC, which is at most 1/2 of all machines. However aggressive retries may easily make pressures from all clients double or even tripple against servers, and make the whole cluster down: More and more requests stuck in buffers, because servers can't process them in-time. All requests have to wait for a very long time to be processed and finally gets timed out, as if the whole cluster is crashed. The default retrying policy is safe generally: unless the connection is broken, retries are rarely sent. However users are able to customize starting conditions for retries by inheriting RetryPolicy, which may turn retries to be "a storm". When you customized RetryPolicy, you need to carefully consider how clients and servers interact and design corresponding tests to verify that retries work as expected.

+## Circuit breaker
+
+Check out [circuit_breaker](../cn/circuit_breaker.md) for more details.
+
 ## Protocols

 The default protocol used by Channel is baidu_std, which is changeable by setting ChannelOptions.protocol. The field accepts both enum and string.

--- a/src/brpc/cluster_recover_policy.cpp
+++ b/src/brpc/cluster_recover_policy.cpp
@@ -52,7 +52,7 @@ bool DefaultClusterRecoverPolicy::StopRecoverIfNecessary() {
    int64_t now_ms = butil::gettimeofday_ms();
    std::unique_lock<butil::Mutex> mu(_mutex);
    if (_last_usable_change_time_ms != 0 && _last_usable != 0 &&
-            (now_ms - _last_usable_change_time_ms > _hold_seconds)) {
+            (now_ms - _last_usable_change_time_ms > _hold_seconds * 1000)) {
        _recovering = false;
        _last_usable = 0;
        _last_usable_change_time_ms = 0;

--- a/test/brpc_load_balancer_unittest.cpp
+++ b/test/brpc_load_balancer_unittest.cpp
@@ -797,10 +797,10 @@ TEST_F(LoadBalancerTest, revived_from_all_failed_sanity) {
    int rand = butil::fast_rand_less_than(2);
    if (rand == 0) {
        brpc::policy::RandomizedLoadBalancer rlb;
-        lb = rlb.New("min_working_instances=2 hold_seconds=2000");
+        lb = rlb.New("min_working_instances=2 hold_seconds=2");
    } else if (rand == 1) {
        brpc::policy::RoundRobinLoadBalancer rrlb;
-        lb = rrlb.New("min_working_instances=2 hold_seconds=2000");
+        lb = rrlb.New("min_working_instances=2 hold_seconds=2");
    }
    brpc::SocketUniquePtr ptr[2];
    for (size_t i = 0; i < ARRAY_SIZE(servers); ++i) {
@@ -902,8 +902,8 @@ public:
 };

 TEST_F(LoadBalancerTest, invalid_lb_params) {
-    const char* lb_algo[] = { "random:mi_working_instances=2 hold_seconds=2000",
-                              "rr:min_working_instances=2 hold_secon=2000" };
+    const char* lb_algo[] = { "random:mi_working_instances=2 hold_seconds=2",
+                              "rr:min_working_instances=2 hold_secon=2" };
    brpc::Channel channel;
    brpc::ChannelOptions options;
    options.protocol = "http";
@@ -919,8 +919,8 @@ TEST_F(LoadBalancerTest, revived_from_all_failed_intergrated) {
    GFLAGS_NS::SetCommandLineOption("circuit_breaker_max_isolation_duration_ms", "3000");
    GFLAGS_NS::SetCommandLineOption("circuit_breaker_min_isolation_duration_ms", "3000");

-    const char* lb_algo[] = { "random:min_working_instances=2 hold_seconds=2000",
-                              "rr:min_working_instances=2 hold_seconds=2000" };
+    const char* lb_algo[] = { "random:min_working_instances=2 hold_seconds=2",
+                              "rr:min_working_instances=2 hold_seconds=2" };
    brpc::Channel channel;
    brpc::ChannelOptions options;
    options.protocol = "http";