@@ -64,7 +64,7 @@ Only 3 (major) user headers: [Server](https://github.com/brpc/brpc/blob/master/s
* Tweak parameters? Checkout [brpc/controller.h](https://github.com/brpc/brpc/blob/master/src/brpc/controller.h). Note that the class is shared by server and channel. Methods are separated into 3 parts: client-side, server-side and both-side.
We tried to make simple things simple. Take naming service as an example. In older RPC implementations you may need to copy a pile of obscure code to make it work, however, in brpc accessing BNS is expressed as `Init("bns://node-name"...`, DNS is "http://domain-name" and local machine list is "file:///home/work/server.list". Without any explanation, you know what it means.
We tried to make simple things simple. Take naming service as an example. In older RPC implementations you may need to copy a pile of obscure code to make it work, however, in brpc accessing BNS is expressed as `Init("bns://node-name", ...)`, DNS is `Init("http://domain-name", ...)` and local machine list is `Init("file:///home/work/server.list", ...)`. Without any explanation, you know what it means.
@@ -20,4 +20,7 @@ You can already use these instructions to do atomic counting, such as multiple t
An atomic instruction is relatively fast when there is not contention or only one thread accessing it. Contention happens when there are multiple threads accessing the same [cacheline](https://en.wikipedia.org/wiki/CPU_cache#Cache_entries). Modern CPU extensively use cache and divide cache into multi-level to get high performance at a low price. The widely used cpu in Baidu which is Intel E5-2620 has 32K L1 dcache and icache, 256K L2 cache and 15M L3 cache. L1 and L2 cache is owned by each core, while L3 cache is shared by all cores. Althouth it is fast for one core to write data into its own L1 cache(4 cycles, 2ns), the data in L1 cache should be also seen by another core when it needs writing or reading from corresponding address. To application, this process is atomic and no instructions can be interleaved. Application must wait for the completion of [cache coherence](https://en.wikipedia.org/wiki/Cache_coherence), which takes longer time compared to other operations. It involves a complicated algorithm which takes approximately 700ns in E5-2620 when highly contented. So it is slow to access the memory shared by multiple threads.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds.
In order to improve performance, we need to avoid synchronizing cacheline in CPU. This is not only related to the performance of the atomic instruction itself, but also affect the overall performance of the program. For example, the effect of using spinlock is still poor in some small critical area scenarios. The problem is that the instruction of exchange, fetch_add and other instructions used to implement spinlock must be executed after the latest cacheline has been synchronized. Although it involves only a few instructions, it is not surprising that these instructions spend a few microseconds. The most effective solution is straightforward: **aviod sharing as possible as you can**. Avoiding contention from the beginning is the best.
- A program using a global multiple-producer-multiple-consumer(MPMC) queueis hard to have multi-core scalability, since the limit throughput of this queue depends on the delay of cpu cache synchronization, rather than the number of cores. It is a best practice to use multiple SPMC or multiple MPSC queue, or even multiple SPSC queue instead, avoid contention at the beginning.
- Another example is global counter. If all threads modifies a global variable frequently, the performance would be poor because all cores are busy synchronizing the same cacheline.
Service is not available before insertion into [brpc.Server](https://github.com/brpc/brpc/blob/master/src/brpc/server.h).
When client sends request, Echo() is called. Meaning of parameters:
**controller**
convertiable to brpc::Controller statically (provided the code runs in brpc.Server), containing parameters that can't included by request and response, check out [src/brpc/controller.h](https://github.com/brpc/brpc/blob/master/src/brpc/controller.h) for details.
**request**
read-only data message from a client.
**response**
Filled by user. If any **required** field is unset, the RPC will be failed.
**done**
done is created by brpc and passed to service's CallMethod(), including all actions after calling CallMethod(): validating response, serialization, packing, sending etc.
**No matter the RPC is successful or not, done->Run() must be called after processing.**
Why not brpc calls done automatically? This is for allowing users to store done and call done->Run() due to some events after CallMethod(), which is **asynchronous service**.
We strongly recommend using **ClosureGuard** to make sure done->Run() is always called, which is the beginning statement in above code snippet:
```c++
brpc::ClosureGuarddone_guard(done);
```
Not matter the callback is exited from middle or the end, done_guard will be destructed, in which done->Run() will be called. The mechanism is called [RAII](https://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization). Without done_guard, you have to add done->Run() before each return, **which is very easy to forget**.
In asynchronous service, processing of the request is not completed when CallMethod() returns and done->Run() should not be called, instead it should be preserved for later usage. At first glance, we don't need ClosureGuard here. However in real applications, a synchronous service possibly fails in the middle and exits CallMethod() due to a lot of reasons. Without ClosureGuard, some error branches may forget to call done->Run() before return. Thus we still recommended using done_guard in asynchronous services. Different from synchronous service, to prevent done->Run() from being called at successful returns, you should call done_guard.release() to release the enclosed done.
How synchronous service and asynchronous service handles done generally:
// Call Run() of internal closure if it's not NULL.
~ClosureGuard();
// Call Run() of internal closure if it's not NULL and set it to `done'.
voidreset(google::protobuf::Closure*done);
// Set internal closure to NULL and return the one before set.
google::protobuf::Closure*release();
};
```
## Set RPC to be failed
Calling Controller.SetFailed() sets the RPC to be failed, if error occurs during sending response, brpc calls the method as well. Users generally calls the method in service's CallMethod(), For example if a processing stage fails, user may call SetFailed() and make sure done->Run() is called, then quit CallMethod (If ClosureGuard is used, done->Run() has no need to be called manually). The code inside server-side done sends response back to client. If SetFailed() was called, error information is sent to client. When client receives the response, its controller will be Failed() (false on success), Controller::ErrorCode() and Controller::ErrorText() are error code and error information respectively.
User may set [status-code](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html) for http calls, which is `controller.http_response().set_status_code()` at server-side. Standard status-code are defined in [http_status_code.h](https://github.com/brpc/brpc/blob/master/src/brpc/http_status_code.h). If SetFailed() is called but status-code is unset, brpc chooses status-code closest to the error-code automatically. brpc::HTTP_STATUS_INTERNAL_SERVER_ERROR(500) is set at worst.
## Get address of client
controller->remote_side() gets address of the client sending the request. The returning type is butil::EndPoint.If client is nginx, remote_side() is address of nginx. To get address of the "real" client before nginx, set `proxy_header ClientIp $remote_addr;` in nginx and call `controller->http_request().GetHeader("ClientIp")` inside RPC to get the address.
In which done->Run() is called after service's CallMethod().
Some server mainly proxes requests to backend servers and waits for the responses for a long time. To make better use of threads, storing done in corresponding event handlers which are triggered to run done->Run() after CallMethod(). This kind of service is **asynchronous**.
Last line of asynchronous service is `done_guard.release()` generally to prevent done->Run() from being called at successful quit of CallMethod(). Check out [example/session_data_and_thread_local](https://github.com/brpc/brpc/tree/master/example/session_data_and_thread_local/) for a example.
Service and Channel both use done to represent the continuation code after CallMethod, but they're **totally different**:
* done of Service is created by brpc, called by user after processing of the request to send back response to client.
* done of Channel is created by user, called by brpc to run post-processing code written by user after completion of RPC.
In an asynchronous service which may access other services, user may manipulate both done in one session, be careful.
# Add Service
A defaultly-constructed Server neither contains any service nor serves requests, just an object.
If ownership is SERVER_OWNS_SERVICE, Server deletes the service at destruction. To prevent the deletion, set ownership to SERVER_DOESNT_OWN_SERVICE. The code to add MyEchoService:
"localhost:9000", "cq01-cos-dev00.cq01:8000", "127.0.0.1:7000" are valid `ip_and_port_str`.
All parameters take default values if `options` is NULL. If you need non-default values, code as follows:
```c++
brpc::ServerOptionsoptions;// with default values
options.xxx=yyy;
...
server.Start(...,&options);
```
## Listen to multiple ports
One server can only listens to one port(not counting ServerOptions.internal_port), you have to start N servers to listen to N ports.
# Stop server
```c++
server.Stop(closewait_ms);// closewait_ms is useless actually, not deleted due to compatibility
server.Join();
```
Stop() does not block while Join() does. The reason for dividing them into two methods is: When multiple servers quit, users can Stop() all servers first and then Join() them together, otherwise servers can only be Stop()+Join() one-by-one and the total waiting time may add up to #server times at worst.
Regardless of the value of closewait_ms, server waits for all requests being processed when exiting, and returns ELOGOFF errors to new requests immediately to prevent them from entering the service. The reason for this design is that as long as the server is still processing requests, there's risk of accessing released memory. If a Join() to your server "stucks", some thread is likely to hang on the request or done->Run() is not called.
When a client sees ELOGOFF, it skips the corresponding server and retry the request on another server. As a result, brpc server always "elegantly" exits, restarting the server does not lose traffic.
RunUntilAskedToQuit() simplifies code on running and stopping the server in most cases. Following code runs the server until Ctrl-C is pressed.
```c++
// Wait until Ctrl-C is pressed, then Stop() and Join() the server.
server.RunUntilAskedToQuit();
// server is stopped, write the code for releasing resources.
```
Services can be added or removed after Join() and server can be Start() again.
FATAL: 05-10 14:40:05: * 0 src/brpc/input_messenger.cpp:89] A message from 127.0.0.1:35217(protocol=baidu_std) is bigger than 67108864 bytes, the connection will be closed. Set max_body_size to allow bigger messages
FATAL: 05-10 13:35:02: * 0 google/protobuf/io/coded_stream.cc:156] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[a27eda84bcdeef529a76f22872b78305] Not allowed to access builtin services, try ServerOptions.internal_port=... instead if you're inside Baidu's network
[bvar](https://github.com/brpc/brpc/tree/master/src/bvar/) is a counting utility designed for multiple threaded applications. It stores data in thread local storage(TLS) to avoid costly cache bouncing caused by concurrent modification. It is much faster than UbMonitor(a legacy counting utility used inside Baidu) and atomic operation in highly contended scenarios. bvar is builtin within brpc, through [/vars](http://brpc.baidu.com:8765/vars) you can access all the exposed bvars inside the server, or a single one specified by [/vars/`VARNAME`](http://brpc.baidu.com:8765/vars/rpc_socket_count). Check out [bvar](../cn/bvar.md) if you'd like add some bvars for you own services. bvar is widely used inside brpc to calculate indicators to show internal status. It is **almost free** in most scenarios to collect data. If you are looking for a utility to collect and show internal status of your application, try bvar at the first time. However bvar is not suitable for general purpose counters, the read process of a single bvar have to combines all the TLS data from the threads that the very bvar has been written, which is very slow compared to the write process and atomic operations.
[bvar](https://github.com/brpc/brpc/tree/master/src/bvar/) is a counting utility designed for multiple threaded applications. It stores data in thread local storage(TLS) to avoid costly cache bouncing caused by concurrent modification. It is much faster than UbMonitor(a legacy counting utility used inside Baidu) and atomic operation in highly contended scenarios. bvar is builtin within brpc, through [/vars](http://brpc.baidu.com:8765/vars) you can access all the exposed bvars inside the server, or a single one specified by [/vars/`VARNAME`](http://brpc.baidu.com:8765/vars/rpc_socket_count). Check out [bvar](../cn/bvar.md) if you'd like add some bvars for you own services. bvar is widely used inside brpc to calculate indicators of internal status. It is **almost free** in most scenarios to collect data. If you are looking for a utility to collect and show internal status of your application, try bvar at the first time. However bvar is designed for general purpose counters, the read process of a single bvar have to combines all the TLS data from the threads that the very bvar has been written, which is very slow compared to the write process and atomic operations.
## Check out bvars
...
...
@@ -14,13 +14,13 @@ The following animation shows how you can check out bvars with pattern. You can
![img](../images/vars_1.gif)
There's also a search box in front of /vars page. You can check out bvars with parts of the names. Different names can be specareted by `,``:` or ` `.
There's a search box in front of /vars page. You can check out bvars with parts of names. Different parts can be specareted by `,``:` or ` `.
![img](../images/vars_2.gif)
It's OK to access /vars throught terminal with curl as well:
You can click most of numerical bvar to check out the timing diagram. Every clickable bvar store values in the recent `60s/60m/24h/30d`, *174* numbers in total。It takes about 1M memory when there are 1000 clickable bvars.
You can click most of numerical bvars to check out their timing diagrams. Each clickable bvar stores value in the recent `60s/60m/24h/30d`, *174* numbers in total。It takes about 1M memory when there are 1000 clickable bvars.
![img](../images/vars_3.gif)
## Calculate and check out percentiles
A percentile indicats the value below a given percentage of samples in a group of samples. E.g. there are 1000 in a very time window,The 500-th in the sorted set(1000 * 50%) is the value 50%-percentile(says median), the number at the 990-th is 99%-percentile(1000 * 99%),the number at 999-th is 99.9%-percentile. Percentiles shows more formation about the latency distribution than average latency, which is very important for you are calculating the SAL of the service. The 99.9% percentile of latency limits the usage of the service rather than the average latency.
A percentile indicates the value below which a given percentage of samples in a group of samples fall. E.g. there are 1000 in a very time window,The 500th in the sorted set(1000 * 50%) is the value of 50%-percentile(a.k.a median), the number at the 990-th is 99%-percentile(1000 * 99%),the number at 999-th is 99.9%-percentile. Percentiles show more information about the latency distribution than average latency, which is very important when calculating SAL. Usually 99.9%-percentile of latency limits the usage of the service rather than the average latency.
Percentiles can be plotted as a CDF curve or a timing diagram.
![img](../images/vars_4.png)
The diagram above is a CDF curve. The vertical axis is the value of latency and the horizontal axis is the percentage of value less than the one at vertical axis. Obviously, this diagram is plotted by percentiles from 10% to 99.99%。 For example, the vertical axis value corresponding to the horizontal axis at 50% is 50%-percentile of the quantile value. CDF is short for [Cumulative Distribution Function](https://en.wikipedia.org/wiki/Cumulative_distribution_function). When we choose a vertical axis value `x`, the corresponding horizontal axis means "the ratio of the value <= `x`". If the numbers are randomly sampled, it stands for "*the probability* of value <= `x`”, which is exacly the definition of distribution. The derivative of the CDF is a [PDF(probability density function)](https://en.wikipedia.org/wiki/Probability_density_function). In other words, if we divide the vertical axis of the CDF into a number of small segments, calculating the difference between the corresponding values at the at both ends and use the difference as a new horizontal axis, it would draw the PDF curve, just as the *(horizontal) normal distribution* or *Poisson distribution*. The density of median will be significantly higher than the long tail in PDF curve. However we care more about the long tail. As a result, most system tests shows CDF curve rather than PDF curve.
The diagram above is a CDF curve. The vertical axis is the value of latency and the horizontal axis is the percentage of value less than the one at vertical axis. Obviously, this diagram is plotted by percentiles from 10% to 99.99%。 For example, the vertical axis value corresponding to the horizontal axis at 50% is 50%-percentile of the quantile value. CDF is short for [Cumulative Distribution Function](https://en.wikipedia.org/wiki/Cumulative_distribution_function). When we choose a vertical axis value `x`, the corresponding horizontal axis means "the ratio of the value <= `x`". If the numbers are randomly sampled, it stands for "*the probability* of value <= `x`”, which is exacly the definition of distribution. The derivative of the CDF is a [PDF(probability density function)](https://en.wikipedia.org/wiki/Probability_density_function). In other words, if we divide the vertical axis of the CDF into a number of small segments, calculating the difference between the corresponding values at the at both ends and use the difference as a new horizontal axis, it would draw the PDF curve, just as the *(horizontal) normal distribution* or *Poisson distribution*. The density of median will be significantly higher than the long tail in PDF curve. However we care more about the long tail. As a result, most system tests show CDF curves rather than PDF curves.
Some simple rules to judge if it is a *good* CDF curve
Some simple rules to check if it is a *good* CDF curve
- The flatter the better. It's the best that the CDF curve is just a horizontal line which indicates that there's no waiting, congestion nor pausing. Of course it's impossible practically.
- The more narrow after 99% the better, which shows the range of long tail. And it's a very important part in the SLA of most system. For example, if a indicator of a storage system is "99.9% of read should finish in *xx milliseconds*"), the maintainer cares about the value at 99.9%; If a indicater of a search system is "99.99% of requests should finish in *xx milliseconds*), you should care about the value at 99.99%.
- The flatter the better. It's best if the CDF curve is just a horizontal line, which indicates that there's no waiting, congestion nor pausing. Of course it's impossible actually.
- The more narrow after 99% the better, which shows the range of long tail. And it's a very important part in SLA of most system. For example, if one of indicators in storage system is "*99.9%* of read should finish in *xx milliseconds*"), the maintainer should care about the value at 99.9%; If one of indicaters in search system is "*99.99%* of requests should finish in *xx milliseconds*), maintainers should care about the value at 99.99%.
It is a good CDF curve if the gradient is small and the tail is narrow.
![img](../images/vars_5.png)
It's a timing diagram of percentiles above, which consists of four curves. The horizontal axis is the time and the vertical axis of the curves from top to bottom is the latency at 99.9%/99%/90%/50%-percentiles. The color from top to bottom is also more and more shallow (from orange to yellow). You can slide the mouse on the curve to read the corresponding data at different time. The number shows above means "The `99%`-percentile of latency before `39` seconds is `330` microseconds". The curve of 99.99% percentile is not show in this diagram since it's significantly higher than the others, which makes the other four curves hard to tell. You can click the bvars whose names end with "*_latency_9999*" to check the 99.99%-percentile along, and you can also check out curves of 50%,90%,99%,99.9% percentiles along in the same ways. The timing digram shows the trends of percentiles, which is very helpful when you are analyzing the performance of the system.
It's a timing diagram of percentiles above, consisting of four curves. The horizontal axis is the time and the vertical axis is the latency. The curves from top to bottom show the timing disgram of latency at 99.9%/99%/90%/50%-percentiles. The color from top to bottom is also more and more shallow (from orange to yellow). You can move the mouse on over curves to read the corresponding data at different time. The number shows above means "The `99%`-percentile of latency before `39` seconds is `330` microseconds". The curve of 99.99% percentile is not counted in this diagram since it's usually significantly higher than the others which would make the other four curves hard to tell. You can click the bvars whose names end with "*_latency_9999*" to check the 99.99%-percentile along, and you can also check out curves of 50%,90%,99%,99.9% percentiles along in the same way. The timing digram shows the trends of percentiles, which is very helpful when you are analyzing the performance of the system.
brpc calculates latency distributed of the services. Users don't need to do this by themselves. The result is like the following piecture.