@@ -4,13 +4,13 @@ Generally there are three ways of IO operations:
- non-blocking IO: If there is nothing to read or overcrowded to write, the API will return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
- asynchronous IO: you call an API to start a read/write operation, and the framework calls you back when it is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux is only supported for files.
IO multiplexing is usually used to increabse IO concurrency in Linux. When the IO concurrency is low, IO multiplexing is not necessarily more efficient than blocking IO, since blocking IO is handled completely by the kernel and system calls like read/write are highly optimized which are apparently more effective. But with the increasement of IO concurrency, the drawbacks of blocking one thread in blocking IO is revealed: the kernel kept switching between threads to do effective works, and a cpu core may only do a little bit of works, immediately replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads will make performance of code dependent on thread-local variables significantly decreased, such as tcmalloc. Once malloc slows down, the overall performance of the program will often decrease. While IO multiplexing is typically composed of a small number of event dispatching threads and some worker threads that run user code, event dispatching and worker can run simultaneously at the same time and kernel can do the job without frequent switching. There is no need to have many threads, so the use of thread-local variables is also more adequate, in which time IO multiplexing is faster than blocking IO. But IO multiplexing also has its own problems, it needs to call more system calls, such as[epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since a red-black tree is used inside epoll, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations dependent on epoll_ctl is often confronted with tricky scalability problem. IO multiplexing has to solve a lot of multi-thresaded problems, the code is much more complex than that using blocking IO.
IO multiplexing is usually used to increase IO concurrency in Linux. When the IO concurrency is low, IO multiplexing is not necessarily more efficient than blocking IO, since blocking IO is handled completely by the kernel and system calls like read/write are highly optimized which are apparently more effective. But with the increasement of IO concurrency, the drawbacks of blocking one thread in blocking IO is revealed: the kernel kept switching between threads to do effective works, and a cpu core may only do a little bit of works, immediately replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads will make performance of code dependent on thread-local variables significantly decreased, such as tcmalloc. Once malloc slows down, the overall performance of the program will often decrease. While IO multiplexing is typically composed of a small number of event dispatching threads and some worker threads that run user code, event dispatching and worker can run simultaneously at the same time and kernel can do the job without frequent switching. There is no need to have many threads, so the use of thread-local variables is also more adequate, in which time IO multiplexing is faster than blocking IO. But IO multiplexing also has its own problems, it needs to call more system calls, such as[epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since a red-black tree is used inside epoll, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations dependent on epoll_ctl is often confronted with tricky scalability problem. IO multiplexing has to solve a lot of multi-threaded problems, the code is much more complex than that using blocking IO.
# Receiving messages
A message is a fix-length binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. Brpc uses one or serveral [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) waiting for events from any fd. Unlike the common IO threads, EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some read requests may starve when many budy fds are assigned to one IO thread. Features like multi-tenant, flow scheduling and [Streaming RPC](streaming_rpc.md) will aggravate the problem. The occasional long delayed read at high load also slows down the reading of all fds in an IO thread, which has a great impact on usability.
A message is a fix-length binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. Brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) waiting for events from any fd. Unlike the common IO threads, EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some read requests may starve when many busy fds are assigned to one IO thread. Features like multi-tenant, flow scheduling and [Streaming RPC](streaming_rpc.md) will aggravate the problem. The occasional long delayed read at high load also slows down the reading of all fds in an IO thread, which has a great impact on usability.
Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll and greate overhead of epoll_ctl, Edge triggered mode is used in EDISP. When receiving an event, an atomic variable related to the current fd is added by one. Only when the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread in which EDISP runs is used to run this new created bthread, making it have better cache locality and read the data as fast as possible. While the bthread in which EDISP runs will be stolen to another pthread and keep running, this process is called work stealing scheduing in bthread. To understand exactly how that atomic variable works, you can read first[atomic instructions](atomic_instructions.md), then [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions happened when reading the same fd [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll and great overhead of epoll_ctl, Edge triggered mode is used in EDISP. When receiving an event, an atomic variable related to the current fd is added by one. Only when the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread in which EDISP runs is used to run this new created bthread, making it have better cache locality and read the data as fast as possible. While the bthread in which EDISP runs will be stolen to another pthread and keep running, this process is called work stealing scheduling in bthread. To understand exactly how that atomic variable works, you can read first[atomic instructions](atomic_instructions.md), then [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions happened when reading the same fd [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h) is responsible for cutting and handling messages and uses callbacks from user to handle different format of data. Parse is used to cut messages from binary data with nearly fixed running time; Process is used to parse messages further(such as deserialization using protobuf) and call users' callbacks with unfixed running time. InputMessenger will try to users' callbacks one by one. When a Parse successfully cut the next message, call the corresponding Process. Since there are often only one message format in one connection, InputMessenger will record the last choice to avoid try every time. If n(n > 1) messages are read from the fd, InputMessenger will launch n-1 bthreads to handle first n-1 messages respectively, and the last message will be processed in the current bthread.
...
...
@@ -18,4 +18,26 @@ It can be seen that the fd and messages from fd are processed concurrently in br
# Sending Messages
A message is a fix-length binary data write to a connection, which may be a response to upstream clients or a request to downstream servers.
A message is a fix-length binary data write to a connection, which may be a response to upstream clients or a request to downstream servers. Multiple threads may send messages to a fd at the same time, and writing to a fd is non-atomic, so how to efficiently queue writes of different thread is a key point here. Brpc uses a kind of wait-free MPSC list to implement this feature. All the data ready to write is put into a single list node, whose next pointer is a special value(Socket::WriteRequest::UNCONNECTED). When a thread want to write out some data, it first try to atomic exchange with list head(Socket::_write_head) and get the value of head before exchange. If this value is empty, the current thread gets the right to write and writes out the data in situ. Otherwise there is another thread writing out and it points the next pointer to the previous head, making the thread currently writing out see this new data. This method makes the writing process wait-free. Although the thread that gets the right to write is not wait-free nor lock-free in principle and may be locked by a node that is still UNCONNECTED(the thread issuing write is scheduled out by os just after atomic exchange and before setting the next pointer), this rarely occurs in practice. In the current Implementations, if all of the data cannot be written out in one time, a KeepWrite thread will be launched and writes the remaining data out. This mechanism is very complex and the general principle is shown below. More details please read [socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp).
![img](../images/write.png)
Since the write in brpc can always complete in a short time, the calling thread can handle more new tasks quickly and background writing thread can also get a batch of tasks to write out, which forms pipeline effect and increases the efficiency of IO at a high throughput.
# Socket
All the data structures related to fd is in [Socket](https://github.com/brpc/brpc/blob/master/src/brpc/socket.h), which is one of the most complex structure in brpc. The unique feature of this structure is that it uses 64-bit SocketId to refer to Socket object to facilitate the use of fd in a multi-threaded environment. Three commonly used methods are:
- Create: create a Socket, and return its SocketId.
- Address: retrieve Socket from id, and wrap it into a unique_ptr(SocketUniquePtr) that will be automatically freed. When Socket is set failed, the pointer returned is empty. As long as Address returns a non-null pointer, its contents are guaranteed not to change until the pointer is automatically destructed. This function is wait-free.
- SetFailed: Mark a Socket as failed and all Address of the corresponding SocketId will return empty pointer until health checking succeeds. Socket will be recycled after reference count hit zero. This function is lock-free.
We can see that Socket is similar to [shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr), SocketId is similar to [weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr). SetFailed owned uniquely by Socket makes it cannot be addressed so that the reference count finally hits zero. Simply using shared_ptr/weak_ptr cannot guarantee this property. For example, when a server needs quit and requests still arrive frequently, the reference count of Socket cannot hit zero causing server unable to quit. What's more, weak_ptr cannot be directly as epoll data, but SocketId can. All these facts drive us to design Socket. The core part of Socket is rarely changed since October 2014 and is very stable.
Using SocketUniquePtr or SocketId depends on the need for strong reference. Just like Controller runs through all the process of RPC and has a lot of interactions with Socket, it uses SocketUniquePtr. Epoll is used to notify that there are events happened in the fd which becomes to a dispensable event if the Socket is recycled, so SocketId is stored in epoll data. As long as SocketUniquePtr is valid, the corresponding Socket in it will not be changed so that users have no needs to the troubles of the race condition and ABA problem and can safely operate on the shared fd. This method also circumvents the implicit reference count and the ownership of memory is clear, causing that the quality of the program is well guaranteed. Brpc has a lot of SocketUniquePtr and SocketId, they really simplified our development.
In fact, Socket manages not only the native fd but also other resources, such as every SubChannel in SelectiveChannel is placed into a Socket, making SelectiveChannel can choose a SubChannel to send just like a normal channel can choose a downstream server. This fake Socket even implements health checking. Streaming RPC also uses Socket to reuse the process of wait-free write.