- blocking IO: after the IO operation is issued, the current thread is blocked until the process of IO ends, which is a kind of synchronous IO, such as the default action of posix [read](http://linux.die.net/man/2/read) and [write](http://linux.die.net/man/2/write).
- non-blocking IO: If there is nothing to read or overcrowded to write, the API will return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
- asynchronous IO: you call an API to start a read/write operation, and the framework calls you back when it is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux is only supported for files.
Generally there are three mechanisms to operate IO:
IO multiplexing is usually used to increase IO concurrency in Linux. When the IO concurrency is low, IO multiplexing is not necessarily more efficient than blocking IO, since blocking IO is handled completely by the kernel and system calls like read/write are highly optimized which are apparently more effective. But with the increasement of IO concurrency, the drawbacks of blocking one thread in blocking IO is revealed: the kernel kept switching between threads to do effective works, and a cpu core may only do a little bit of works, immediately replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads will make performance of code dependent on thread-local variables significantly decreased, such as tcmalloc. Once malloc slows down, the overall performance of the program will often decrease. While IO multiplexing is typically composed of a small number of event dispatching threads and some worker threads that run user code, event dispatching and worker can run simultaneously at the same time and kernel can do the job without frequent switching. There is no need to have many threads, so the use of thread-local variables is also more adequate, in which time IO multiplexing is faster than blocking IO. But IO multiplexing also has its own problems, it needs to call more system calls, such as[epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since a red-black tree is used inside epoll, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations dependent on epoll_ctl is often confronted with tricky scalability problem. IO multiplexing has to solve a lot of multi-threaded problems, the code is much more complex than that using blocking IO.
- blocking IO: once an IO operation is issued, the current thread is blocked until the IO ends, which is a kind of synchronous IO, such as the default action of posix [read](http://linux.die.net/man/2/read) and [write](http://linux.die.net/man/2/write).
- non-blocking IO: If there is nothing to read or too much to write, APIs that would block return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
- asynchronous IO: Start a read/write operation with a callback, which will be called when the IO is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux only supports files.
non-blocking IO is usually used for increasing IO concurrency in Linux. When the IO concurrency is low, non-blocking IO is not necessarily more efficient than blocking IO, which is handled completely by the kernel. System calls like read/write are highly optimized and more efficient. But when IO concurrency increases, the drawback of blocking-one-thread in blocking IO arises: the kernel keeps switching between threads to do effective jobs, and a cpu core may only do a little bit of work before being replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads decrease performance of code dependent on thread-local variables, such as tcmalloc. Once malloc slows down, the overall performance of the program decreases as well. As a contrast, non-blocking IO is typically composed with a relatively small number of event dispatching threads and worker threads(running user code), which are often reused by different tasks (in another word, part of scheduling work is moved to userland). Event dispatchers and workers can run on different cpu cores simultaneously to do the job without frequent switches in the kernel. There is no need to have many threads, so the use of thread-local variables is also more adequate. All these factors make non-blocking IO faster than blocking IO. But non-blocking IO also has its own problems, one of which is more system calls, such as [epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since epoll is implemented as a red-black tree, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations heavily dependent on epoll_ctl is often confronted with multi-core scalability issues. non-blocking IO also has to solve a lot of multi-threaded problems, producing more complex code than blocking IO.
# Receiving messages
A message is a fix-length binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. Brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) waiting for events from any fd. Unlike the common IO threads, EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some read requests may starve when many busy fds are assigned to one IO thread. Features like multi-tenant, flow scheduling and [Streaming RPC](streaming_rpc.md) will aggravate the problem. The occasional long delayed read at high load also slows down the reading of all fds in an IO thread, which has a great impact on usability.
A message is a bounded binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) to wait for events from file descriptors. Unlike the common "IO threads", EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some reads are delayed when many fds in one IO thread are busy. Multi-tenancy, complicated load balancing and [Streaming RPC](streaming_rpc.md) make the problem worse. Under high workloads, regular long delays from a fd may slow down reads from all other fds in the IO thread, impacting usability greater.
Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll and great overhead of epoll_ctl, Edge triggered mode is used in EDISP. When receiving an event, an atomic variable related to the current fd is added by one. Only when the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread in which EDISP runs is used to run this new created bthread, making it have better cache locality and read the data as fast as possible. While the bthread in which EDISP runs will be stolen to another pthread and keep running, this process is called work stealing scheduling in bthread. To understand exactly how that atomic variable works, you can read first[atomic instructions](atomic_instructions.md), then [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions happened when reading the same fd[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll (at the time of developing brpc) and overhead of epoll_ctl, edge triggered mode is used in EDISP. After receiving an event, an atomic variable associated with the fd is added by one atomically. If the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread worker in which EDISP runs is yielded to the newly created bthread to make it better at cache locality and start reading ASAP. The bthread in which EDISP runs will be stolen to another pthread and keep running, this mechanism is work stealing used in bthreads. To understand exactly how that atomic variable works, you can read [atomic instructions](atomic_instructions.md) first, then check [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions on dispatching events of one fd be[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h)is responsible for cutting and handling messages and uses callbacks from user to handle different format of data. Parse is used to cut messages from binary data with nearly fixed running time; Process is used to parse messages further(such as deserialization using protobuf) and call users' callbacks with unfixed running time. InputMessenger will try to users' callbacks one by one. When a Parse successfully cut the next message, call the corresponding Process. Since there are often only one message format in one connection, InputMessenger will record the last choice to avoid try every time. If n(n > 1) messages are read from the fd, InputMessenger will launch n-1 bthreads to handle first n-1 messages respectively, and the last message will be processed in the current bthread.
[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h)cuts messages and uses customizable callbacks to handle different format of data. `Parse` callback cuts messages from binary data and has relatively stable running time; `Process` parses messages further(such as parsing by protobuf) and calls users' callbacks, which vary in running time. If n(n > 1) messages are read from the fd, InputMessenger launches n-1 bthreads to handle first n-1 messages respectively, and processes the last message in-place. InputMessenger tries protocols one by one. Since one connections often has only one type of messages, InputMessenger remembers current protocol to avoid trying for protocols next time.
It can be seen that the fd and messages from fd are processed concurrently in brpc, which makes brpc very good at handling large messages and can handle different sources of messages at high loads to reduce long tails.
It can be seen that messages from different fds or even same fd are processed concurrently in brpc, which makes brpc be good at handling large messages and reducing long tails on processing messages from different sources under high workloads.
# Sending Messages
A message is a fix-length binary data write to a connection, which may be a response to upstream clients or a request to downstream servers. Multiple threads may send messages to a fd at the same time, and writing to a fd is non-atomic, so how to efficiently queue writes of different thread is a key point here. Brpc uses a kind of wait-free MPSC list to implement this feature. All the data ready to write is put into a single list node, whose next pointer is a special value(Socket::WriteRequest::UNCONNECTED). When a thread want to write out some data, it first try to atomic exchange with list head(Socket::_write_head) and get the value of head before exchange. If this value is empty, the current thread gets the right to write and writes out the data in situ. Otherwise there is another thread writing out and it points the next pointer to the previous head, making the thread currently writing out see this new data. This method makes the writing process wait-free. Although the thread that gets the right to write is not wait-free nor lock-free in principle and may be locked by a node that is still UNCONNECTED(the thread issuing write is scheduled out by os just after atomic exchange and before setting the next pointer), this rarely occurs in practice. In the current Implementations, if all of the data cannot be written out in one time, a KeepWrite thread will be launched and writes the remaining data out. This mechanism is very complex and the general principle is shown below. More details please read [socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp).
A message is a bounded binary data written to a connection, which may be a response to upstream clients or a request to downstream servers. Multiple threads may send messages to a fd at the same time, however writing to a fd is non-atomic, so how to queue writes from different thread efficiently is a key technique. brpc uses a special wait-free MPSC list to solve the issue. All data ready to write is put into a node of a singly-linked list, whose next pointer points to a special value(`Socket::WriteRequest::UNCONNECTED`). When a thread wants to write out some data, it tries to atomically exchange the node with the list head(Socket::_write_head) first. If the head before exchange is empty, the caller gets the right to write and writes out the data in-place once. Otherwise there must be another thread writing. The caller points the next pointer to the head returned to make the linked list connected. The thread that is writing will see the new head later and write new data.
This method makes the writing contentions wait-free. Although the thread that gets the right to write is not wait-free nor lock-free in principle and may be blocked by a node that is still UNCONNECTED(the thread issuing write is swapped out by OS just after atomic exchange and before setting the next pointer, within execution time of just one instruction), the blocking rarely happens in practice. In current implementations, if the data cannot be written fully in one call, a KeepWrite bthread is created to write the remaining data. This mechanism is pretty complicated and the principle is depicted below. Read [socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp) for more details.

Since the write in brpc can always complete in a short time, the calling thread can handle more new tasks quickly and background writing thread can also get a batch of tasks to write out, which forms pipeline effect and increases the efficiency of IO at a high throughput.
Since writes in brpc always complete within short time, the calling thread can handle new tasks more quickly and background KeepWrite threads also get more tasks to write in one batch, forming pipelines and increasing the efficiency of IO at high throughputs.
# Socket
All the data structures related to fd is in [Socket](https://github.com/brpc/brpc/blob/master/src/brpc/socket.h), which is one of the most complex structure in brpc. The unique feature of this structure is that it uses 64-bit SocketId to refer to Socket object to facilitate the use of fd in a multi-threaded environment. Three commonly used methods are:
[Socket](https://github.com/brpc/brpc/blob/master/src/brpc/socket.h) contains data structures related to fd and is one of the most complex structure in brpc. The unique feature of this structure is that it uses 64-bit SocketId to refer to a Socket object to facilitate usages of fd in multi-threaded environments. Commonly used methods:
- Create: create a Socket and return its SocketId.
- Address: retrieve Socket from an id, and wrap it into a unique_ptr(SocketUniquePtr) that will be automatically released. When Socket is set failed, the pointer returned is empty. As long as Address returns a non-null pointer, the contents are guaranteed to not change until the pointer is destructed. This function is wait-free.
- SetFailed: Mark a Socket as failed and Address() on corresponding SocketId will return empty pointer (until health checking resumes the socket). Sockets are recycled when no one is referencing it anymore. This function is lock-free.
- Create: create a Socket, and return its SocketId.
- Address: retrieve Socket from id, and wrap it into a unique_ptr(SocketUniquePtr) that will be automatically freed. When Socket is set failed, the pointer returned is empty. As long as Address returns a non-null pointer, its contents are guaranteed not to change until the pointer is automatically destructed. This function is wait-free.
- SetFailed: Mark a Socket as failed and all Address of the corresponding SocketId will return empty pointer until health checking succeeds. Socket will be recycled after reference count hit zero. This function is lock-free.
We can see that, Socket is similar to [shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr) in the sense of referential counting and SocketId is similar to [weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr). The unique `SetFailed` prevents Socket from being addressed so that the reference count can hit zero finally. Simply using shared_ptr/weak_ptr cannot guarantee this. For example, when a server needs to quit when requests are still coming in frequently, the reference count of Socket may not hit zero and the server is unable to stop quickly. What' more, weak_ptr cannot be directly put into epoll data, but SocketId can. These factors lead to design of Socket which is stable and rarely changed since 2014.
We can see that when talking about reference count, Socket is similar to [shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr), SocketId is similar to [weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr). SetFailed owned uniquely by Socket makes it cannot be addressed so that the reference count finally hits zero. Simply using shared_ptr/weak_ptr cannot guarantee this property. For example, when a server needs quit and requests still arrive frequently, the reference count of Socket cannot hit zero causing server unable to quit. What's more, weak_ptr cannot be directly as epoll data, but SocketId can. All these facts drive us to design Socket. The core part of Socket is rarely changed since October 2014 and is very stable.
Using SocketUniquePtr or SocketId depends on if a strong reference is needed. For example, Controller is used thoroughly inside RPC and has a lot of interactions with Socket, it uses SocketUniquePtr. Epoll notifies events on fds and events of a recycled socket can be ignored, so epoll uses SocketId.
Using SocketUniquePtr or SocketId depends on the need for strong reference. Just like Controller runs through all the process of RPC and has a lot of interactions with Socket, it uses SocketUniquePtr. Epoll is used to notify that there are events happened in the fd which becomes to a dispensable event if the Socket is recycled, so SocketId is stored in epoll data. As long as SocketUniquePtr is valid, the corresponding Socket in it will not be changed so that users have no needs to the troubles of the race condition and ABA problem and can safely operate on the shared fd. This method also circumvents the implicit reference count and the ownership of memory is clear, causing that the quality of the program is well guaranteed. Brpc has a lot of SocketUniquePtr and SocketId, they really simplified our development.
As long as SocketUniquePtr is valid, the Socket enclosed will not be changed so that users have no need to care about race conditions and ABA problems, being safer to operate the shared socket. This method also circumvents implicit referential counting and make ownership of memory more clear, producing better-quality programs. brpc uses SocketUniquePtr and SocketId a lot to simplify related issues.
In fact, Socket manages not only the native fd but also other resources, such as every SubChannel in SelectiveChannel is placed into a Socket, making SelectiveChannel can choose a SubChannel to send just like a normal channel can choose a downstream server. This fake Socket even implements health checking. Streaming RPC also uses Socket to reuse the process of wait-free write.
In fact, Socket manages not only the native fd but also other resources, such as SubChannel in SelectiveChannel is also manged by Socket, making SelectiveChannel choose a SubChannel just like a normal channel choosing a downstream server. The faked Socket even implements health checking. Streaming RPC also uses Socket to reuse the code on wait-free write.