Review io.md and add mutual links

76d87bcd · gejun · 81618781 · 76d87bcd · 76d87bcd
Commit 76d87bcd authored Sep 29, 2017 by gejun
Show whitespace changes
Inline Side-by-side

Showing with 37 additions and 25 deletions

io.md docs/cn/io.md +13 -7

io.md docs/en/io.md +24 -18

No files found.
--- a/docs/cn/io.md
+++ b/docs/cn/io.md
+[English version](../en/io.md)
+
 一般有三种操作IO的方式：

 - blocking IO: 发起IO操作后阻塞当前线程直到IO结束，标准的同步IO，如默认行为的posix [read](http://linux.die.net/man/2/read)和[write](http://linux.die.net/man/2/write)。
@@ -8,21 +10,23 @@ linux一般使用non-blocking IO提高IO并发度。当IO并发度很低时，no

 # 收消息

-“消息”指从连接读入的有边界的二进制串，可能是来自上游client的request或来自下游server的response。brpc使用一个或多个[EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.h)(简称为EDISP)等待任一fd发生事件。和常见的“IO线程”不同，EDISP不负责读取。IO线程的问题在于一个线程同时只能读一个fd，当多个繁忙的fd聚集在一个IO线程中时，一些读取就被延迟了。多租户、复杂分流算法，[Streaming RPC](streaming_rpc.md)等功能会加重这个问题。高负载下偶尔的长延时read也会拖慢一个IO线程中所有fd的读取，对可用性的影响幅度较大。
+“消息”指从连接读入的有边界的二进制串，可能是来自上游client的request或来自下游server的response。brpc使用一个或多个[EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.h)(简称为EDISP)等待任一fd发生事件。和常见的“IO线程”不同，EDISP不负责读取。IO线程的问题在于一个线程同时只能读一个fd，当多个繁忙的fd聚集在一个IO线程中时，一些读取就被延迟了。多租户、复杂分流算法，[Streaming RPC](streaming_rpc.md)等功能会加重这个问题。高负载下常见的某次读取卡顿会拖慢一个IO线程中所有fd的读取，对可用性的影响幅度较大。

-由于epoll的[一个bug](https://patchwork.kernel.org/patch/1970231/)及epoll_ctl较大的开销，EDISP使用Edge triggered模式。当收到事件时，EDISP给一个原子变量加1，只有当加1前的值是0时启动一个bthread处理对应fd上的数据。在背后，EDISP把所在的pthread让给了新建的bthread，使其有更好的cache locality，可以尽快地读取fd上的数据。而EDISP所在的bthread会被偷到另外一个pthread继续执行，这个过程即是bthread的work stealing调度。要准确理解那个原子变量的工作方式可以先阅读[atomic instructions](atomic_instructions.md)，再看[Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp)。这些方法使得brpc读取同一个fd时产生的竞争是[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)的。
+由于epoll的[一个bug](https://patchwork.kernel.org/patch/1970231/)(开发brpc时仍有)及epoll_ctl较大的开销，EDISP使用Edge triggered模式。当收到事件时，EDISP给一个原子变量加1，只有当加1前的值是0时启动一个bthread处理对应fd上的数据。在背后，EDISP把所在的pthread让给了新建的bthread，使其有更好的cache locality，可以尽快地读取fd上的数据。而EDISP所在的bthread会被偷到另外一个pthread继续执行，这个过程即是bthread的work stealing调度。要准确理解那个原子变量的工作方式可以先阅读[atomic instructions](atomic_instructions.md)，再看[Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp)。这些方法使得brpc读取同一个fd时产生的竞争是[wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom)的。

-[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h)负责从fd上切割和处理消息，它通过用户回调函数理解不同的格式。Parse一般是把消息从二进制流上切割下来，运行时间较固定；Process则是进一步解析消息(比如反序列化为protobuf)后调用用户回调，时间不确定。InputMessenger会逐一尝试用户指定的多套回调，当某一个Parse成功切割下一个消息后，调用对应的Process。由于一个连接上往往只有一种消息格式，InputMessenger会记录下上次的选择，而避免每次都重复尝试。若一次从某个fd读取出n个消息(n > 1)，InputMessenger会启动n-1个bthread分别处理前n-1个消息，最后一个消息则会在原地被Process。
+[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h)负责从fd上切割和处理消息，它通过用户回调函数理解不同的格式。Parse一般是把消息从二进制流上切割下来，运行时间较固定；Process则是进一步解析消息(比如反序列化为protobuf)后调用用户回调，时间不确定。若一次从某个fd读取出n个消息(n > 1)，InputMessenger会启动n-1个bthread分别处理前n-1个消息，最后一个消息则会在原地被Process。InputMessenger会逐一尝试多种协议，由于一个连接上往往只有一种消息格式，InputMessenger会记录下上次的选择，而避免每次都重复尝试。

 可以看到，fd间和fd内的消息都会在brpc中获得并发，这使brpc非常擅长大消息的读取，在高负载时仍能及时处理不同来源的消息，减少长尾的存在。

 # 发消息

-"消息”指向连接写出的有边界的二进制串，可能是发向上游client的response或下游server的request。多个线程可能会同时向一个fd发送消息，而写fd又是非原子的，所以如何高效率地排队不同线程写出的数据包是这里的关键。brpc使用一种wait-free MPSC链表来实现这个功能。所有待写出的数据都放在一个单链表节点中，next指针初始化为一个特殊值(Socket::WriteRequest::UNCONNECTED)。当一个线程想写出数据前，它先尝试和对应的链表头(Socket::_write_head)做原子交换，返回值是交换前的链表头。如果返回值为空，说明它获得了写出的权利，它会在原地写一次数据。否则说明有另一个线程在写，它把next指针指向返回的头，那样正在写的线程之后会看到并写出这块数据。这套方法可以让写竞争是wait-free的，而获得写权利的线程虽然在原理上不是wait-free也不是lock-free，可能会被一个值仍为UNCONNECTED的节点锁定（这需要发起写的线程正好在原子交换后，在设置next指针前，仅仅一条指令的时间内被OS换出），但在实践中很少出现。在当前的实现中，如果获得写权利的线程一下子无法写出所有的数据，会启动一个KeepWrite线程继续写，直到所有的数据都被写出。这套逻辑非常复杂，大致原理如下图，细节请阅读[socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp)。
+"消息”指向连接写出的有边界的二进制串，可能是发向上游client的response或下游server的request。多个线程可能会同时向一个fd发送消息，而写fd又是非原子的，所以如何高效率地排队不同线程写出的数据包是这里的关键。brpc使用一种wait-free MPSC链表来实现这个功能。所有待写出的数据都放在一个单链表节点中，next指针初始化为一个特殊值(Socket::WriteRequest::UNCONNECTED)。当一个线程想写出数据前，它先尝试和对应的链表头(Socket::_write_head)做原子交换，返回值是交换前的链表头。如果返回值为空，说明它获得了写出的权利，它会在原地写一次数据。否则说明有另一个线程在写，它把next指针指向返回的头以让链表连通。正在写的线程之后会看到新的头并写出这块数据。
+
+这套方法可以让写竞争是wait-free的，而获得写权利的线程虽然在原理上不是wait-free也不是lock-free，可能会被一个值仍为UNCONNECTED的节点锁定（这需要发起写的线程正好在原子交换后，在设置next指针前，仅仅一条指令的时间内被OS换出），但在实践中很少出现。在当前的实现中，如果获得写权利的线程一下子无法写出所有的数据，会启动一个KeepWrite线程继续写，直到所有的数据都被写出。这套逻辑非常复杂，大致原理如下图，细节请阅读[socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp)。

 ![img](../images/write.png)

-由于brpc的写出总能很快地返回，调用线程可以更快地处理新任务，后台写线程也能每次拿到一批任务批量写出，在大吞吐时容易形成流水线效应而提高IO效率。
+由于brpc的写出总能很快地返回，调用线程可以更快地处理新任务，后台KeepWrite写线程也能每次拿到一批任务批量写出，在大吞吐时容易形成流水线效应而提高IO效率。

 # Socket

@@ -32,9 +36,11 @@ linux一般使用non-blocking IO提高IO并发度。当IO并发度很低时，no
 - Address：取得id对应的Socket，包装在一个会自动释放的unique_ptr中(SocketUniquePtr)，当Socket被SetFailed后，返回指针为空。只要Address返回了非空指针，其内容保证不会变化，直到指针自动析构。这个函数是wait-free的。
 - SetFailed：标记一个Socket为失败，之后所有对那个SocketId的Address会返回空指针（直到健康检查成功）。当Socket对象没人使用后会被回收。这个函数是lock-free的。

-可以看到Socket类似[shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr)，SocketId类似[weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr)，但Socket独有的SetFailed可以在需要时确保Socket不能被继续Address而最终引用计数归0，单纯使用shared_ptr/weak_ptr则无法保证这点，当一个server需要退出时，如果请求仍频繁地到来，对应Socket的引用计数可能迟迟无法清0而导致server无法退出。另外weak_ptr无法直接作为epoll的data，而SocketId可以。这些因素使我们设计了Socket，这个类的核心部分自14年10月完成后很少改动，非常稳定。
+可以看到Socket类似[shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr)，SocketId类似[weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr)，但Socket独有的SetFailed可以在需要时确保Socket不能被继续Address而最终引用计数归0，单纯使用shared_ptr/weak_ptr则无法保证这点，当一个server需要退出时，如果请求仍频繁地到来，对应Socket的引用计数可能迟迟无法清0而导致server无法退出。另外weak_ptr无法直接作为epoll的data，而SocketId可以。这些因素使我们设计了Socket，这个类的核心部分自14年完成后很少改动，非常稳定。
+
+存储SocketUniquePtr还是SocketId取决于是否需要强引用。像Controller贯穿了RPC的整个流程，和Socket中的数据有大量交互，它存放的是SocketUniquePtr。epoll主要是提醒对应fd上发生了事件，如果Socket回收了，那这个事件是可有可无的，所以它存放了SocketId。

-存储SocketUniquePtr还是SocketId取决于是否需要强引用。像Controller贯穿了RPC的整个流程，和Socket中的数据有大量交互，它存放的是SocketUniquePtr。epoll主要是提醒对应fd上发生了事件，如果Socket回收了，那这个事件是可有可无的，所以它存放了SocketId。由于SocketUniquePtr只要有效，其中的数据就不会变，这个机制使用户不用关心麻烦的race conditon和ABA problem，可以放心地对共享的fd进行操作。这种方法也规避了隐式的引用计数，内存的ownership明确，程序的质量有很好的保证。brpc中有大量的SocketUniquePtr和SocketId，它们确实简化了我们的开发。
+由于SocketUniquePtr只要有效，其中的数据就不会变，这个机制使用户不用关心麻烦的race conditon和ABA problem，可以放心地对共享的fd进行操作。这种方法也规避了隐式的引用计数，内存的ownership明确，程序的质量有很好的保证。brpc中有大量的SocketUniquePtr和SocketId，它们确实简化了我们的开发。

 事实上，Socket不仅仅用于管理原生的fd，它也被用来管理其他资源。比如SelectiveChannel中的每个Sub Channel都被置入了一个Socket中，这样SelectiveChannel可以像普通channel选择下游server那样选择一个Sub Channel进行发送。这个假Socket甚至还实现了健康检查。Streaming RPC也使用了Socket以复用wait-free的写出过程。


--- a/docs/en/io.md
+++ b/docs/en/io.md
-Generally there are three ways of IO operations:
+[中文版](../cn/io.md)

- blocking IO: after the IO operation is issued, the current thread is blocked until the process of IO ends, which is a kind of synchronous IO, such as the default action of posix [read](http://linux.die.net/man/2/read) and [write](http://linux.die.net/man/2/write).
- non-blocking IO: If there is nothing to read or overcrowded to write, the API will return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
- asynchronous IO: you call an API to start a read/write operation, and the framework calls you back when it is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux is only supported for files.
+Generally there are three mechanisms to operate IO:

-IO multiplexing is usually used to increase IO concurrency in Linux. When the IO concurrency is low, IO multiplexing is not necessarily more efficient than blocking IO, since blocking IO is handled completely by the kernel and system calls like read/write are highly optimized which are apparently more effective. But with the increasement of IO concurrency, the drawbacks of blocking one thread in blocking IO is revealed: the kernel kept switching between threads to do effective works, and a cpu core may only do a little bit of works, immediately replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads will make performance of code dependent on thread-local variables significantly decreased, such as tcmalloc. Once malloc slows down, the overall performance of the program will often decrease. While IO multiplexing is typically composed of a small number of event dispatching threads and some worker threads that run user code, event dispatching and worker can run simultaneously at the same time and kernel can do the job without frequent switching. There is no need to have many threads, so the use of thread-local variables is also more adequate, in which time IO multiplexing is faster than blocking IO. But IO multiplexing also has its own problems, it needs to call more system calls, such as[epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since a red-black tree is used inside epoll, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations dependent on epoll_ctl is often confronted with tricky scalability problem. IO multiplexing has to solve a lot of multi-threaded problems, the code is much more complex than that using blocking IO.
+- blocking IO: once an IO operation is issued, the current thread is blocked until the IO ends, which is a kind of synchronous IO, such as the default action of posix [read](http://linux.die.net/man/2/read) and [write](http://linux.die.net/man/2/write).
+- non-blocking IO: If there is nothing to read or too much to write, APIs that would block return immediately with an error code. Non-blocking IO is often used with IO multiplexing([poll](http://linux.die.net/man/2/poll), [select](http://linux.die.net/man/2/select), [epoll](http://linux.die.net/man/4/epoll) in Linux or [kqueue](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2) in BSD).
+- asynchronous IO: Start a read/write operation with a callback, which will be called when the IO is done, such as [OVERLAPPED](https://msdn.microsoft.com/en-us/library/windows/desktop/ms684342(v=vs.85).aspx) + [IOCP](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx) in Windows. Native AIO in Linux only supports files.
+
+non-blocking IO is usually used for increasing IO concurrency in Linux. When the IO concurrency is low, non-blocking IO is not necessarily more efficient than blocking IO, which is handled completely by the kernel. System calls like read/write are highly optimized and more efficient. But when IO concurrency increases, the drawback of blocking-one-thread in blocking IO arises: the kernel keeps switching between threads to do effective jobs, and a cpu core may only do a little bit of work before being replaced by another thread, causing cpu cache not fully utilized. In addition a large number of threads decrease performance of code dependent on thread-local variables, such as tcmalloc. Once malloc slows down, the overall performance of the program decreases as well. As a contrast, non-blocking IO is typically composed with a relatively small number of event dispatching threads and worker threads(running user code), which are often reused by different tasks (in another word, part of scheduling work is moved to userland). Event dispatchers and workers can run on different cpu cores simultaneously to do the job without frequent switches in the kernel. There is no need to have many threads, so the use of thread-local variables is also more adequate. All these factors make non-blocking IO faster than blocking IO. But non-blocking IO also has its own problems, one of which is more system calls, such as [epoll_ctl](http://man7.org/linux/man-pages/man2/epoll_ctl.2.html). Since epoll is implemented as a red-black tree, epoll_ctl is not a very fast operation, especially in multi-threaded environment. Implementations heavily dependent on epoll_ctl is often confronted with multi-core scalability issues. non-blocking IO also has to solve a lot of multi-threaded problems, producing more complex code than blocking IO.

 # Receiving messages

-A message is a fix-length binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. Brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) waiting for events from any fd. Unlike the common IO threads, EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some read requests may starve when many busy fds are assigned to one IO thread. Features like multi-tenant, flow scheduling and [Streaming RPC](streaming_rpc.md) will aggravate the problem. The occasional long delayed read at high load also slows down the reading of all fds in an IO thread, which has a great impact on usability.
+A message is a bounded binary data read from a connection, which may be a request from upstream clients or a response from downstream servers. brpc uses one or several [EventDispatcher](https://github.com/brpc/brpc/blob/master/src/brpc/event_dispatcher.cpp)(referred to as EDISP) to wait for events from file descriptors. Unlike the common "IO threads", EDISP is not responsible for reading or writing. The problem of IO threads is that one thread can only read one fd at a given time, so some reads are delayed when many fds in one IO thread are busy. Multi-tenancy, complicated load balancing and [Streaming RPC](streaming_rpc.md) make the problem worse. Under high workloads, regular long delays from a fd may slow down reads from all other fds in the IO thread, impacting usability greater.

-Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll and great overhead of epoll_ctl, Edge triggered mode is used in EDISP. When receiving an event, an atomic variable related to the current fd is added by one. Only when the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread in which EDISP runs is used to run this new created bthread, making it have better cache locality and read the data as fast as possible. While the bthread in which EDISP runs will be stolen to another pthread and keep running, this process is called work stealing scheduling in bthread. To understand exactly how that atomic variable works, you can read first[atomic instructions](atomic_instructions.md), then [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions happened when reading the same fd [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).
+Because of a [bug](https://patchwork.kernel.org/patch/1970231/) of epoll (at the time of developing brpc) and overhead of epoll_ctl, edge triggered mode is used in EDISP. After receiving an event, an atomic variable associated with the fd is added by one atomically. If the variable is zero before addition, a bthread is started to handle the data from the fd. The pthread worker in which EDISP runs is yielded to the newly created bthread to make it better at cache locality and start reading ASAP. The bthread in which EDISP runs will be stolen to another pthread and keep running, this mechanism is work stealing used in bthreads. To understand exactly how that atomic variable works, you can read [atomic instructions](atomic_instructions.md) first, then check [Socket::StartInputEvent](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp). These methods make contentions on dispatching events of one fd be [wait-free](http://en.wikipedia.org/wiki/Non-blocking_algorithm#Wait-freedom).

-[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h) is responsible for cutting and handling messages and uses callbacks from user to handle different format of data. Parse is used to cut messages from binary data with nearly fixed running time; Process is used to parse messages further(such as deserialization using protobuf) and call users' callbacks with unfixed running time. InputMessenger will try to users' callbacks one by one. When a Parse successfully cut the next message, call the corresponding Process. Since there are often only one message format in one connection, InputMessenger will record the last choice to avoid try every time. If n(n > 1) messages are read from the fd, InputMessenger will launch n-1 bthreads to handle first n-1 messages respectively, and the last message will be processed in the current bthread.
+[InputMessenger](https://github.com/brpc/brpc/blob/master/src/brpc/input_messenger.h) cuts messages and uses customizable callbacks to handle different format of data. `Parse` callback cuts messages from binary data and has relatively stable running time; `Process` parses messages further(such as parsing by protobuf) and calls users' callbacks, which vary in running time. If n(n > 1) messages are read from the fd, InputMessenger launches n-1 bthreads to handle first n-1 messages respectively, and processes the last message in-place. InputMessenger tries protocols one by one. Since one connections often has only one type of messages, InputMessenger remembers current protocol to avoid trying for protocols next time. 

-It can be seen that the fd and messages from fd are processed concurrently in brpc, which makes brpc very good at handling large messages and can handle different sources of messages at high loads to reduce long tails.
+It can be seen that messages from different fds or even same fd are processed concurrently in brpc, which makes brpc be good at handling large messages and reducing long tails on processing messages from different sources under high workloads.

 # Sending Messages

-A message is a fix-length binary data write to a connection, which may be a response to upstream clients or a request to downstream servers. Multiple threads may send messages to a fd at the same time, and writing to a fd is non-atomic, so how to efficiently queue writes of different thread is a key point here. Brpc uses a kind of wait-free MPSC list to implement this feature. All the data ready to write is put into a single list node, whose next pointer is a special value(Socket::WriteRequest::UNCONNECTED). When a thread want to write out some data, it first try to atomic exchange with list head(Socket::_write_head) and get the value of head before exchange. If this value is empty, the current thread gets the right to write and writes out the data in situ. Otherwise there is another thread writing out and it points the next pointer to the previous head, making the thread currently writing out see this new data. This method makes the writing process wait-free. Although the thread that gets the right to write is not wait-free nor lock-free in principle and may be locked by a node that is still UNCONNECTED(the thread issuing write is scheduled out by os just after atomic exchange and before setting the next pointer), this rarely occurs in practice. In the current Implementations, if all of the data cannot be written out in one time, a KeepWrite thread will be launched and writes the remaining data out. This mechanism is very complex and the general principle is shown below. More details please read [socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp).
+A message is a bounded binary data written to a connection, which may be a response to upstream clients or a request to downstream servers. Multiple threads may send messages to a fd at the same time, however writing to a fd is non-atomic, so how to queue writes from different thread efficiently is a key technique. brpc uses a special wait-free MPSC list to solve the issue. All data ready to write is put into a node of a singly-linked list, whose next pointer points to a special value(`Socket::WriteRequest::UNCONNECTED`). When a thread wants to write out some data, it tries to atomically exchange the node with the list head(Socket::_write_head) first. If the head before exchange is empty, the caller gets the right to write and writes out the data in-place once. Otherwise there must be another thread writing. The caller points the next pointer to the head returned to make the linked list connected. The thread that is writing will see the new head later and write new data.
+
+This method makes the writing contentions wait-free. Although the thread that gets the right to write is not wait-free nor lock-free in principle and may be blocked by a node that is still UNCONNECTED(the thread issuing write is swapped out by OS just after atomic exchange and before setting the next pointer, within execution time of just one instruction), the blocking rarely happens in practice. In current implementations, if the data cannot be written fully in one call, a KeepWrite bthread is created to write the remaining data. This mechanism is pretty complicated and the principle is depicted below. Read [socket.cpp](https://github.com/brpc/brpc/blob/master/src/brpc/socket.cpp) for more details.

 ![img](../images/write.png)

-Since the write in brpc can always complete in a short time, the calling thread can handle more new tasks quickly and background writing thread can also get a batch of tasks to write out, which forms pipeline effect and increases the efficiency of IO at a high throughput.
+Since writes in brpc always complete within short time, the calling thread can handle new tasks more quickly and background KeepWrite threads also get more tasks to write in one batch, forming pipelines and increasing the efficiency of IO at high throughputs.

 # Socket

-All the data structures related to fd is in [Socket](https://github.com/brpc/brpc/blob/master/src/brpc/socket.h), which is one of the most complex structure in brpc. The unique feature of this structure is that it uses 64-bit SocketId to refer to Socket object to facilitate the use of fd in a multi-threaded environment. Three commonly used methods are:
+[Socket](https://github.com/brpc/brpc/blob/master/src/brpc/socket.h) contains data structures related to fd and is one of the most complex structure in brpc. The unique feature of this structure is that it uses 64-bit SocketId to refer to a Socket object to facilitate usages of fd in multi-threaded environments. Commonly used methods:
+
+- Create: create a Socket and return its SocketId.
+- Address: retrieve Socket from an id, and wrap it into a unique_ptr(SocketUniquePtr) that will be automatically released. When Socket is set failed, the pointer returned is empty. As long as Address returns a non-null pointer, the contents are guaranteed to not change until the pointer is destructed. This function is wait-free.
+- SetFailed: Mark a Socket as failed and Address() on corresponding SocketId will return empty pointer (until health checking resumes the socket). Sockets are recycled when no one is referencing it anymore. This function is lock-free.

- Create: create a Socket, and return its SocketId.
- Address: retrieve Socket from id, and wrap it into a unique_ptr(SocketUniquePtr) that will be automatically freed. When Socket is set failed, the pointer returned is empty. As long as Address returns a non-null pointer, its contents are guaranteed not to change until the pointer is automatically destructed. This function is wait-free.
- SetFailed: Mark a Socket as failed and all Address of the corresponding SocketId will return empty pointer until health checking succeeds. Socket will be recycled after reference count hit zero. This function is lock-free.
+We can see that, Socket is similar to [shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr) in the sense of referential counting and SocketId is similar to [weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr). The unique `SetFailed` prevents Socket from being addressed so that the reference count can hit zero finally. Simply using shared_ptr/weak_ptr cannot guarantee this. For example, when a server needs to quit when requests are still coming in frequently, the reference count of Socket may not hit zero and the server is unable to stop quickly. What' more, weak_ptr cannot be directly put into epoll data, but SocketId can. These factors lead to design of Socket which is stable and rarely changed since 2014.

-We can see that when talking about reference count, Socket is similar to [shared_ptr](http://en.cppreference.com/w/cpp/memory/shared_ptr), SocketId is similar to [weak_ptr](http://en.cppreference.com/w/cpp/memory/weak_ptr). SetFailed owned uniquely by Socket makes it cannot be addressed so that the reference count finally hits zero. Simply using shared_ptr/weak_ptr cannot guarantee this property. For example, when a server needs quit and requests still arrive frequently, the reference count of Socket cannot hit zero causing server unable to quit. What's more, weak_ptr cannot be directly as epoll data, but SocketId can. All these facts drive us to design Socket. The core part of Socket is rarely changed since October 2014 and is very stable.
+Using SocketUniquePtr or SocketId depends on if a strong reference is needed. For example, Controller is used thoroughly inside RPC and has a lot of interactions with Socket, it uses SocketUniquePtr. Epoll notifies events on fds and events of a recycled socket can be ignored, so epoll uses SocketId. 

-Using SocketUniquePtr or SocketId depends on the need for strong reference. Just like Controller runs through all the process of RPC and has a lot of interactions with Socket, it uses SocketUniquePtr. Epoll is used to notify that there are events happened in the fd which becomes to a dispensable event if the Socket is recycled, so SocketId is stored in epoll data. As long as SocketUniquePtr is valid, the corresponding Socket in it will not be changed so that users have no needs to the troubles of the race condition and ABA problem and can safely operate on the shared fd. This method also circumvents the implicit reference count and the ownership of memory is clear, causing that the quality of the program is well guaranteed. Brpc has a lot of SocketUniquePtr and SocketId, they really simplified our development.
+As long as SocketUniquePtr is valid, the Socket enclosed will not be changed so that users have no need to care about race conditions and ABA problems, being safer to operate the shared socket. This method also circumvents implicit referential counting and make ownership of memory more clear, producing better-quality programs. brpc uses SocketUniquePtr and SocketId a lot to simplify related issues.

-In fact, Socket manages not only the native fd but also other resources, such as every SubChannel in SelectiveChannel is placed into a Socket, making SelectiveChannel can choose a SubChannel to send just like a normal channel can choose a downstream server. This fake Socket even implements health checking. Streaming RPC also uses Socket to reuse the process of wait-free write.
+In fact, Socket manages not only the native fd but also other resources, such as SubChannel in SelectiveChannel is also manged by Socket, making SelectiveChannel choose a SubChannel just like a normal channel choosing a downstream server. The faked Socket even implements health checking. Streaming RPC also uses Socket to reuse the code on wait-free write.

 # The full picture