Commit e7db7387 authored by zhujiashun's avatar zhujiashun

refine docs for cluster recover

parent 2ce30e82
...@@ -236,7 +236,7 @@ locality-aware,优先选择延时低的下游,直到其延时高于其他机 ...@@ -236,7 +236,7 @@ locality-aware,优先选择延时低的下游,直到其延时高于其他机
### 从集群宕机后恢复时的客户端限流 ### 从集群宕机后恢复时的客户端限流
集群宕机指的是集群中所有server都处于不可用的状态。由于健康检查机制,当集群恢复正常后,server会间隔性地上线。当某一个server上线后,所有的流量会发送过去,可能导致服务再次过载。若熔断开启,则可能导致其它server上线前该server再次熔断,集群永远无法恢复。作为解决方案,brpc提供了在集群宕机后恢复时的限流机制:当集群中没有可用server时,集群进入恢复状态,假设根据峰值QPS估算出能服务所有请求的最小server数量为min_working_instances,当前集群可用的server数量为q,则在恢复状态时,client接受请求的概率为q/min_working_instances,否则丢弃;若一段时间hold_seconds内q保持不变,则把流量重新发送全部可用的server上,并离开恢复状态。在恢复阶段时,可以通过判断controller.ErrorCode()是否等于brpc::ERJECT来判断该次请求是否被拒绝,被拒绝的请求不会被框架重试。 集群宕机指的是集群中所有server都处于不可用的状态。由于健康检查机制,当集群恢复正常后,server会间隔性地上线。当某一个server上线后,所有的流量会发送过去,可能导致服务再次过载。若熔断开启,则可能导致其它server上线前该server再次熔断,集群永远无法恢复。作为解决方案,brpc提供了在集群宕机后恢复时的限流机制:当集群中没有可用server时,集群进入恢复状态,假设正好能服务所有请求的server数量为min_working_instances,当前集群可用的server数量为q,则在恢复状态时,client接受请求的概率为q/min_working_instances,否则丢弃;若一段时间hold_seconds内q保持不变,则把流量重新发送全部可用的server上,并离开恢复状态。在恢复阶段时,可以通过判断controller.ErrorCode()是否等于brpc::ERJECT来判断该次请求是否被拒绝,被拒绝的请求不会被框架重试。
此恢复机制要求下游server的能力是类似的,所以目前只针对rr和random有效,开启方式是在*load_balancer_name*后面加上min_working_instances和hold_seconds参数的值,例如: 此恢复机制要求下游server的能力是类似的,所以目前只针对rr和random有效,开启方式是在*load_balancer_name*后面加上min_working_instances和hold_seconds参数的值,例如:
```c++ ```c++
......
...@@ -238,7 +238,7 @@ Check out [Consistent Hashing](consistent_hashing.md) for more details. ...@@ -238,7 +238,7 @@ Check out [Consistent Hashing](consistent_hashing.md) for more details.
### Client-side throttling for recovery from cluster downtime ### Client-side throttling for recovery from cluster downtime
Cluster downtime refers to the state in which all servers in the cluster are unavailable. Due to the health check mechanism, when the cluster returns to normal, server will go online one by one. When a server is online, all traffic will be sent to it, which may cause the service to be overloaded again. If circuit breaker is enabled, server may be offline again before the other servers go online, and the cluster can never be recovered. As a solution, brpc provides a client-side throttling mechanism for recovery after cluster downtime. When no server is available in the cluster, the cluster enters a recovery state. Assuming that the minimum number of servers that can serve all requests based on peak QPS is min_working_instances, current The number of servers available for the cluster is q, then in the recovery state, the probability of client accepting the request is q/min_working_instances, otherwise it is discarded. If q remains unchanged for a period of time(hold_seconds), the traffic is resent to all available servers and leaves recovery state. Whether the request is rejected in the recovery state is indicated by whether controller.ErrorCode() is equal to brpc::ERJECT, and the rejected request will not be retried by the framework. Cluster downtime refers to the state in which all servers in the cluster are unavailable. Due to the health check mechanism, when the cluster returns to normal, server will go online one by one. When a server is online, all traffic will be sent to it, which may cause the service to be overloaded again. If circuit breaker is enabled, server may be offline again before the other servers go online, and the cluster can never be recovered. As a solution, brpc provides a client-side throttling mechanism for recovery after cluster downtime. When no server is available in the cluster, the cluster enters recovery state. Assuming that the minimum number of servers that can serve all requests is min_working_instances, current number of servers available in the cluster is q, then in recovery state, the probability of client accepting the request is q/min_working_instances, otherwise it is discarded. If q remains unchanged for a period of time(hold_seconds), the traffic is resent to all available servers and leaves recovery state. Whether the request is rejected in recovery state is indicated by whether controller.ErrorCode() is equal to brpc::ERJECT, and the rejected request will not be retried by the framework.
This recovery mechanism requires the capabilities of downstream servers to be similar, so it is currently only valid for rr and random. The way to enable it is to add the values of min_working_instances and hold_seconds parameters after *load_balancer_name*, for example: This recovery mechanism requires the capabilities of downstream servers to be similar, so it is currently only valid for rr and random. The way to enable it is to add the values of min_working_instances and hold_seconds parameters after *load_balancer_name*, for example:
```c++ ```c++
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment