Reviewed threading_overview.md

dc3a5a72 · gejun · 873fbb52 · dc3a5a72 · dc3a5a72
Commit dc3a5a72 authored Oct 13, 2017 by gejun
Show whitespace changes
Inline Side-by-side

Showing with 42 additions and 59 deletions

threading_overview.md docs/cn/threading_overview.md +19 -28

threading_overview.md docs/en/threading_overview.md +23 -31

No files found.
--- a/docs/cn/threading_overview.md
+++ b/docs/cn/threading_overview.md
+[English version](../en/threading_overview.md)
 # 常见线程模型
-## 一个连接对应一个线程或进程
+## 连接独占线程或进程
-线程/进程处理来自绑定连接的消息，连接不断开线程/进程就不退。当连接数逐渐增多时，线程/进程占用的资源和上下文切换成本会越来越大，性能很差，这就是[C10K问题](http://en.wikipedia.org/wiki/C10k_problem)的来源。这两种方法常见于早期的web server，现在很少使用。
+在这个模型中，线程/进程处理来自绑定连接的消息，在连接断开前不退也不做其他事情。当连接数逐渐增多时，线程/进程占用的资源和上下文切换成本会越来越大，性能很差，这就是[C10K问题](http://en.wikipedia.org/wiki/C10k_problem)的来源。这种方法常见于早期的web server，现在很少使用。
-## 单线程reactor
+## 单线程[reactor](http://en.wikipedia.org/wiki/Reactor_pattern)
-以[libevent](http://libevent.org/)[, ](http://en.wikipedia.org/wiki/Reactor_pattern)[libev](http://software.schmorp.de/pkg/libev.html)等event-loop库为典型，一般是由一个event dispatcher等待各类事件，待事件发生后原地调用event handler，全部调用完后等待更多事件，故为"loop"。实质是把多段逻辑按事件触发顺序交织在一个系统线程中。一个event-loop只能使用一个核，故此类程序要么是IO-bound，要么是逻辑有确定的较短的运行时间（比如http server)，否则一个回调卡住就会卡住整个程序，容易产生高延时，在实践中这类程序非常不适合多人参与，一不注意整个程序就显著变慢了。event-loop程序的扩展性主要靠多进程。
+以[libevent](http://libevent.org/), [libev](http://software.schmorp.de/pkg/libev.html)等event-loop库为典型。这个模型一般由一个event dispatcher等待各类事件，待事件发生后**原地**调用对应的event handler，全部调用完后等待更多事件，故为"loop"。这个模型的实质是把多段逻辑按事件触发顺序交织在一个系统线程中。一个event-loop只能使用一个核，故此类程序要么是IO-bound，要么是每个handler有确定的较短的运行时间（比如http server)，否则一个耗时漫长的回调就会卡住整个程序，产生高延时。在实践中这类程序不适合多开发者参与，一个人写了阻塞代码可能就会拖慢其他代码的响应。由于event handler不会同时运行，不太会产生复杂的race condition，一些代码不需要锁。此类程序主要靠部署更多进程增加扩展性。
-单线程reactor的运行方式如下图所示：
+单线程reactor的运行方式及问题如下图所示：
 ![img](../images/threading_overview_1.png)
 ## N:1线程库
-以[GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html)等为典型，一般是把N个用户线程映射入一个系统线程(LWP)，同时只能运行一个用户线程，调用阻塞函数时才会放弃时间片，又称为[Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science))。N:1线程库与单线程reactor等价，只是事件回调被替换为了独立的栈和寄存器状态，运行回调变成了跳转至对应的上下文。由于所有的逻辑运行在一个系统线程中，N:1线程库不太会产生复杂的race condition，一些编码场景不需要锁。和event loop库一样，由于只能利用一个核，N:1线程库无法充分发挥多核性能，只适合一些特定的程序。不过这也使其减少了多核间的跳转，加上对独立signal mask的舍弃，上下文切换可以做的很快（100~200ns），N:1线程库的性能一般和event loop库差不多，扩展性也主要靠多进程。
+又称为[Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science))，以[GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html)等为典型，一般是把N个用户线程映射入一个系统线程。同时只运行一个用户线程，调用阻塞函数时才会切换至其他用户线程。N:1线程库与单线程reactor在能力上等价，但事件回调被替换为了上下文(栈,寄存器,signals)，运行回调变成了跳转至上下文。和event loop库一样，单个N:1线程库无法充分发挥多核性能，只适合一些特定的程序。只有一个系统线程对CPU cache较为友好，加上舍弃对signal mask的支持的话，用户线程间的上下文切换可以很快(100~200ns)。N:1线程库的性能一般和event loop库差不多，扩展性也主要靠多进程。
 ## 多线程reactor
-以kylin, [boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html)为典型。一般由一个或多个线程分别运行event dispatcher，待事件发生后把event handler交给一个worker thread执行。由于百度内以SMP机器为主，这种可以利用多核的结构更加合适，多线程交换信息的方式也比多进程更多更简单，所以往往能让多核的负载更加均匀。不过由于cache一致性的限制，多线程reactor模型并不能获得线性于核数的扩展性，在特定的场景中，粗糙的多线程reactor实现跑在24核上甚至没有精致的单线程reactor实现跑在1个核上快。reactor有proactor变种，即用异步IO代替event dispatcher，boost::asio[在windows下](http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx)就是proactor。
+以[boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html)为典型。一般由一个或多个线程分别运行event dispatcher，待事件发生后把event handler交给一个worker线程执行。 这个模型是单线程reactor的自然扩展，可以利用多核。由于共用地址空间使得线程间交互变得廉价，worker thread间一般会更及时地均衡负载，而多进程一般依赖更前端的服务来分割流量，一个设计良好的多线程reactor程序往往能比同一台机器上的多个单线程reactor进程更均匀地使用不同核心。不过由于[cache一致性](atomic_instructions.md#cacheline)的限制，多线程reactor并不能获得线性于核心数的性能，在特定的场景中，粗糙的多线程reactor实现跑在24核上甚至没有精致的单线程reactor实现跑在1个核上快。由于多线程reactor包含多个worker线程，单个event handler阻塞未必会延缓其他handler，所以event handler未必得非阻塞，除非所有的worker线程都被阻塞才会影响到整体进展。事实上，大部分RPC框架都使用了这个模型，且回调中常有阻塞部分，比如同步等待访问下游的RPC返回。
-多线程reactor的运行方式如下：
+多线程reactor的运行方式及问题如下：
 ![img](../images/threading_overview_2.png)
-# 那我们还能改进什么呢？
+## M:N线程库
-## 扩展性并不好
-理论上用户把逻辑都写成事件驱动是最好的，但实际上由于编码难度和可维护性的问题，用户的使用方式大都是混合的：回调中往往会发起同步操作，从而阻塞住worker线程使其无法去处理其他请求。一个请求往往要经过几十个服务，这意味着线程把大量时间花在了等待下游请求上。用户往往得开几百个线程以维持足够的吞吐，这造成了高强度的调度开销。另外为了简单，任务的分发大都是使用全局竞争的mutex + condition。当所有线程都在争抢时，效率显然好不到哪去。更好的办法是使用更多的任务队列和相应的的调度算法以减少全局竞争。
-## 异步编程是困难的
-异步编程中的流程控制对于专家也充满了陷阱。任何挂起操作（sleep一会儿，等待某事完成etc）都意味着用户需要显式地保存状态，并在回调函数中恢复状态。异步代码往往得写成状态机的形式。当挂起的位置较少时，这有点麻烦，但还是可把握的。问题在于一旦挂起发生在条件判断、循环、子函数中，写出这样的状态机并能被很多人理解和维护，几乎是不可能的，而这在分布式系统中又很常见，因为一个节点往往要对多个其他节点同时发起操作。另外如果恢复可由多种事件触发（比如fd有数据或超时了），挂起和恢复的过程容易出现race condition，对多线程编码能力要求很高。语法糖(比如lambda)可以让编码不那么“麻烦”，但无法降低难度。
-## 异步编程不能使用[RAII](http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) 
+即把M个用户线程映射入N个系统线程。M:N线程库可以决定一段代码何时开始在哪运行，并何时结束，相比多线程reactor在调度上具备更多的灵活度。但实现全功能的M:N线程库是困难的，它一直是个活跃的研究话题。我们这里说的M:N线程库特别针对编写网络服务，在这一前提下一些需求可以简化，比如没有时间片抢占，没有(完备的)优先级等。M:N线程库可以在用户态也可以在内核中实现，用户态的实现以新语言为主，比如GHC threads和goroutine，这些语言可以围绕线程库设计全新的关键字并拦截所有相关的API。而在现有语言中的实现往往得修改内核，比如[Windows UMS](https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx)和google SwicthTo(虽然是1:1，但基于它可以实现M:N的效果)。相比N:1线程库，M:N线程库在使用上更类似于系统线程，需要用锁或消息传递保证代码的线程安全。
-更常见的方法是使用共享指针，这看似方便，但也使内存的ownership变得难以捉摸，如果内存泄漏了，很难定位哪里没有释放；如果segment fault了，也不知道哪里多释放了一下。大量使用引用计数的用户代码很难控制代码质量，容易长期在内存问题上耗费时间。如果引用计数还需要手动维护，保持质量就更难了（kylin就是这样），每次修改都会让维护者两难。没有RAII模式也使得使用同步原语更易出错，比如不能使用lock_guard；比如在callback之外lock，callback之内unlock；在实践中都很容易出错。
+# 问题
-## cache bouncing
+## 多核扩展性
-当event dispatcher把任务递给worker时，用户逻辑不得不从一个核跳到另一个核，相关的cpu cache必须同步过来，这是微秒级的操作，并不很快。如果worker能直接在event dispatcher所在的核上运行就更好了，因为大部分系统（在这个时间尺度下）并没有密集的事件流，尽快运行已有的任务的优先级高于event dispatcher获取新事件。另一个例子是收到response后最好在当前cpu core唤醒发起request的阻塞线程。
+理论上代码都写成事件驱动型能最大化reactor模型的能力，但实际由于编码难度和可维护性，用户的使用方式大都是混合的：回调中往往会发起同步操作，阻塞住worker线程使其无法处理其他请求。一个请求往往要经过几十个服务，线程把大量时间花在了等待下游请求上，用户得开几百个线程以维持足够的吞吐，这造成了高强度的调度开销，并降低了TLS相关代码的效率。任务的分发大都是使用全局mutex + condition保护的队列，当所有线程都在争抢时，效率显然好不到哪去。更好的办法也许是使用更多的任务队列，并调整调度算法以减少全局竞争。比如每个系统线程有独立的runqueue，由一个或多个scheduler把用户线程分发到不同的runqueue，每个系统线程优先运行自己runqueue中的用户线程，然后再考虑其他线程的runqueue。这当然更复杂，但比全局mutex + condition有更好的扩展性。这种结构也更容易支持NUMA。
-# M:N线程库
+当event dispatcher把任务递给worker线程时，用户逻辑很可能从一个核心跳到另一个核心，并等待相应的cacheline同步过来，并不很快。如果worker的逻辑能直接运行于event dispatcher所在的核心上就好了，因为大部分时候尽快运行worker的优先级高于获取新事件。类似的是收到response后最好在当前核心唤醒正在同步等待RPC的线程。
-要满足我们期望的这些改善，一个选择是M:N线程库，即把M个用户线程映射入N个系统线程(LWP)。我们看看上面的问题在这个模型中是如何解决的：
+## 异步编程
- 每个系统线程往往有独立的runqueue，可能有一个或多个scheduler把用户线程分发到不同的runqueue，每个系统线程会优先运行自己runqueue中的用户线程，然后再做全局调度。这当然更复杂，但比全局mutex + condition有更好的扩展性。
+异步编程中的流程控制对于专家也充满了陷阱。任何挂起操作，如sleep一会儿或等待某事完成，都意味着用户需要显式地保存状态，并在回调函数中恢复状态。异步代码往往得写成状态机的形式。当挂起较少时，这有点麻烦，但还是可把握的。问题在于一旦挂起发生在条件判断、循环、子函数中，写出这样的状态机并能被很多人理解和维护，几乎是不可能的，而这在分布式系统中又很常见，因为一个节点往往要与多个节点同时交互。另外如果唤醒可由多种事件触发（比如fd有数据或超时了），挂起和恢复的过程容易出现race condition，对多线程编码能力要求很高。语法糖(比如lambda)可以让编码不那么“麻烦”，但无法降低难度。
- 虽然M:N线程库和多线程reactor是等价的，但同步的编码难度显著地低于事件驱动，大部分人都能很快掌握同步操作。
- 不用把一个函数拆成若干个回调，可以使用RAII。
- 从用户线程A切换为用户线程B时，也许我们可以让B在A所在的核上运行，而让A去其他核运行，从而使更高优先级的B更少受到cache miss的干扰。
-实现全功能的M:N线程库是极其困难的，所以M:N线程库一直是个活跃的研究话题。我们这里说的M:N线程库特别针对编写网络服务，在这一前提下一些需求可以简化，比如没有时间片抢占，没有优先级等，即使有也以简单粗暴为主，无法和操作系统级别的实现相比。M:N线程库可以在用户态也可以在内核中实现，用户态的实现以新语言为主，比如GHC threads和goroutine，这些语言可以围绕线程库设计全新的API。而在主流语言中的实现往往得修改内核，比如[Windows UMS](https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx)。google SwicthTo虽然是1:1，但基于它可以实现M:N的效果。在使用上M:N线程库更类似于系统线程，需要用锁或消息传递保证代码的线程安全。
+异步编程不能充分使用[RAII](http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization). 更常见的方法是使用共享指针，这看似方便，但也使内存的ownership变得难以捉摸，如果内存泄漏了，很难定位哪里没有释放；如果segment fault了，也不知道哪里多释放了一下。大量使用引用计数的用户代码很难控制代码质量，容易长期在内存问题上耗费时间。如果引用计数还需要手动维护，保持质量就更难了，维护者也不会愿意改进。没有RAII有时会需要在callback之外lock，callback之内unlock，实践中很容易出错。
--- a/docs/en/threading_overview.md
+++ b/docs/en/threading_overview.md
-# Common thread model
+[中文版](../cn/threading_overview.md)
-## A connection corresponds to a thread or process
+# Common threading models
-A thread/process handles all the messages from a fd and quits until the connection is closed. When the number of connections increases, the resources occupied by threads/processes and the cost of context switch will become increasingly large which causes poor performance. It is the source of the [C10K](http://en.wikipedia.org/wiki/C10k_problem) problem. These two methods(using thread or process) are common in early web servers and are rarely used today.
+## Connections own threads or processes exclusively
-## Single-threaded reactor
+In this model, a thread/process handles all messages from a connection and does not quit or do other jobs before the connection is closed. When number of connections increases, resources occupied by threads/processes and costs of context switches becomes more and more overwhelming, making servers perform poorly. This situation is summarized as the [C10K](http://en.wikipedia.org/wiki/C10k_problem) problem, which was common in early web servers but rarely present today.
-The event-loop library such as [libevent](http://libevent.org/)[, ](http://en.wikipedia.org/wiki/Reactor_pattern)[libev](http://software.schmorp.de/pkg/libev.html) is a typical example. Usually a event dispatcher is responsible for waiting different kinds of event and calls event handler in-place after an event happens. After handler is processed, dispatcher waits more events, from where "loop" comes from. Essentially all handler functions are executed in the order of occurrence in a system thread. One event-loop can use only one core, so this kind of program is either IO-bound or has a short and fixed running time(such as http servers). Otherwise one callback will block the whole program and causes high latencies. In practice this kind of program is not suitable for many people involved, because the performance may be significantly degraded if no enough attentions are paid. The extensibility of the event-loop program depends on multiple processes.
+## Single-threaded [reactor](http://en.wikipedia.org/wiki/Reactor_pattern)
-The single-threaded reactor works as shown below:
+Event-loop libraries such as [libevent](http://libevent.org/), [libev](http://software.schmorp.de/pkg/libev.html) are typical examples. There's usually an event dispatcher in this model responsible for waiting on different kinds of events and calling the corresponding event handler **in-place** when an event occurs. After all handlers(that need to be called) are called, the dispatcher waits for more events again, which forms a "loop". Essentially this model multiplexes(interleaves) code written in different handlers into a system thread. One event-loop can only utilize one core, so this kind of program is either IO-bound or each handler runs within short and deterministic time(such as http servers), otherwise one callback taking long time blocks the whole program and causes high delays. In practice this kind of program is not suitable for involving many developers, because just one person adding inappropriate blocking code may significantly slow down reactivities of all other code. Since event handlers don't run simultaneously, race conditions between callbacks are relatively simple and in some scenarios locks are not needed. These programs are often scaled by deploying more processes. 
+How single-threaded reactors work and the problems related are demonstrated below: (The Chinese characters in red: "Uncontrollable! unless the service is specialized")
 ![img](../images/threading_overview_1.png)
-## N:1 thread library
+## N:1 threading library
-Generally, N user threads are mapped into a system thread (LWP), and only one user thread can be run, such as [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). When the blocking function is called, current user thread is scheduled out. It is also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). N:1 thread library is equal to single-thread reactor. Event callback is replaced by an independent stack and registers, and running callbacks becomes jumping to the corresponding context. Since all user codes run in a system thread, N:1 thread library does not produce complex race conditions, and some scenarios do not need a lock. Because only one core can be used just like event loop library, N:1 thread library cannot give full play to multi-core performance, only suitable for some specific scenarios. But it also to reduce the jump between different cores, coupled with giving up the independent signal mask, context switch can be done quickly(100 ~ 200ns). Generally, the performance of N:1 thread library is as good as event loop and its extensibility also depends on multiple processes.
+Also known as [Fiber](http://en.wikipedia.org/wiki/Fiber_(computer_science)). Typical examples are [GNU Pth](http://www.gnu.org/software/pth/pth-manual.html), [StateThreads](http://state-threads.sourceforge.net/index.html). This model maps N user threads into a single system thread, in which only one user thread runs at the same time and the running user thread does not switch to other user threads until a blocking primitive is called (cooperative). N:1 threading libraries are equal to single-threaded reactors on capabilities, except that callbacks are replaced by contexts (stacks, registers, signals) and running callbacks becomes jumping to contexts. Similar to event-loop libraries, a N:1 threading library cannot utilize multiple CPU cores, thus only suitable for specialized applications. However a single system thread is more friendly to CPU caches, with removal of the support for signal masks, context switches between user threads can be done very fast(100 ~ 200ns). N:1 threading libraries perform as well as event-loop libraries and are also scaled by deploying more processes in general.
-## Multi-threaded reactr
+## Multi-threaded reactor
-Kylin, [boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html) is a typical example. Generally event dispatcher is run by one or several threads and schedules event handler to a worker thread after event happens. Since SMP machines are widely used in Baidu, the structure using multiple cores like this is more suitable and the method of exchanging messages between threads is simpler than that between processes, so it often makes multi-core load more uniform. However, due to cache coherence restrictions, the multi-threaded reactor model does not achieve linearity in the scalability of the core. In a particular scenario, the rough multi-threaded reactor running on a 24-core machine is not even faster than a single-threaded reactor with a dedicated implementation. Reactor has a proactor variant, namely using asynchronous IO to replace event dispatcher. Boost::asio is a proactor under [Windows](http://msdn.microsoft.com/en-us/library/aa365198(VS.85).aspx).
+[boost::asio](http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio.html) is a typical example. One or several threads run event dispatchers respectively. When an event occurs, the event handler is queued into a worker thread to run. This model is extensible from single-threaded reactor intuitively and able to make use of multiple CPU cores. Since sharing memory addresses makes interactions between threads much cheaper, the worker threads are able to balance loads between each other frequently, as a contrast multiple single-threaded reactors basically depend on the front-end servers to distribute traffic. A well-implemented multi-threaded reactor is likely to utilize CPU cores more evenly than multiple single-threaded reactors on the same machine. However, due to [cache coherence](atomic_instructions.md#cacheline), multi-threaded reactors are unlikely to achieve linear scalability on CPU cores. In particular scenarios, a badly implemented multi-threaded reactor running on 24 cores is even slower than a well-tuned single-threaded reactor. Because a multi-threaded reactor has multiple worker threads, one blocking event handler may not delay other handlers. As a result, event handlers are not required to be non-blocking unless all worker threads are blocked, in which case the overall progress is affected. In fact, most RPC frameworks are implemented in this model with event handlers that may block, such as synchronously waiting for RPCs to downstream servers.
-The multi-threaded reactor works2 as shown blew:
+How multi-threaded reactors work and problems related are demonstrated below:
 ![img](../images/threading_overview_2.png)
-# What else can we improve?
+## M:N threading library
-## Extensibility is not good enough
-Ideally, it is best for users to write event-driven codes, but in reality because of the difficulty of coding and maintainability, the using way of users is mostly mixed: synchronous IO is often issued in callbacks so that worker thread is blocked and it cannot process other requests. A request often goes through dozens of services, which means that a thread spent a lot of time waiting for responses from downstreams. Users often have to launch hundreds of threads to maintain high throughput, which resulted in high-intensity scheduling overhead. What's more, for simplicity, mutex and condition involved global contention is often used for distributing tasks. When all threads are in a highly contented state, the efficiency is clearly not high. A better approach is to use more task queues and corresponding scheduling algorithms to reduce global contention.
-## Asynchronous programming is difficult
-The code flow control in asynchronous programming is full of traps even for the experts. Any suspending operation(such as sleeping for a while or waiting for something to finish) implies that users should save states explicitly. Asynchronous code is often written in the form of state machines. When the position of suspension is less, it is a bit cumbersome, but still can be grasped. The problem is that once the suspending occurs in the conditional judgment, the loop or the sub-function, it is almost impossible to write such a state machine and be understood and maintained by many people, while it is quite common in distributed system since a node often wants to initiate operation on multiple other nodes at the same time. In addition, if the recovery can be triggered by a variety of events (such as fd has data or timeout happens), the process of suspend and resume is prone to have race conditions, which requires high ability of multi-threaded programming. Syntax sugar(such as lambda) can make coding less troublesome but can not reduce difficulty.
-## RAII(http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) cannot be used in Asynchronous programming
+This model maps M user threads into N system threads. A M:N threading library is able to decide when and where to run a piece of code and when to end the execution, which is more flexible at scheduling compared to multi-threaded reactors. But full-featured M:N threading libraries are difficult to implement and remaining as active research topics. The M:N threading library that we're talking about is specialized for building online services, in which case, some of the requirements can be simplified, namely no (complete) preemptions and priorities. M:N threading libraries can be implemented either in userland or OS kernel. New programming languages prefer implementations in userland, such as GHC thread and goroutine, which is capable of adding brand-new keywords and intercepting related APIs on threading. Implementation in existing languages often have to modify the OS kernel, such as [Windows UMS](https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx) and google SwicthTo(which is 1:1, however M:N effects can be achieved based on it). Compared to N:1 threading libraries, usages of M:N threading libraries are more similar to system threads, which need locks or message passings to ensure thread safety.
-The more common way is to use a shared pointer, which seems convenient, but also makes the ownership of memory become elusive. If the memory leaked, it is difficult to locate the address not releasing; if segment fault happens, we also do not know the address releasing twice. It is difficult to control the quality of the code that uses the reference count extensively and it is easy to spend time on the memory problem for a long term. If the reference count also requires manual maintenance, keeping the quality of code is even more difficult(such as the code in kylin) and each modification will make the maintainers in a dilemma. No RAII also makes use of synchronization primitives more error-prone, for example, lock_guard cannot be used, locking outside callback and unlocking inside callback, which is prone to error in practice.
+# Issues
-## cache bouncing
+## Multi-core scalability
-When the event dispatcher dispatches a task to a worker, the user code has to jump from one core to another and the relevant cpu cache must be synchronized, which is a not very fast operation that takes several microseconds. If the worker can run directly on the core where the event dispatcher is running, since most systems(at this time scale) do not have intensive event flows, the priority of running the existing tasks is higher than getting new events from event dispatcher. Another example is that it is best to wake up the blocking thread that send the request in the same cpu core when response is received.
+Ideally capabilities of the reactor model are maximized when all source code is programmed in event-driven manner, but in reality because of the difficulties and maintainability, users are likely to mix usages: synchronous IO is often issued in callbacks, blocking worker threads from processing other requests. A request often goes through dozens of services, making worker threads spend a lot of time waiting for responses from downstream servers. Users have to launch hundreds of threads to maintain enough throughput, which imposes intensive pressure on scheduling and lowers efficiencies of TLS related code. Tasks are often pushed into a queue protected with a global mutex and condition, which performs poorly when many threads are contending for it. A better approach is to deploy more task queues and adjust the scheduling algorithm to reduce global contentions. Namely each system thread has its own runqueue, and one or more schedulers dispatch user threads to different runqueues. Each system thread runs user threads from its own runqueue before considering other runqueues, which is more complicated but more scalable than the global mutex+condition solution. This model is also easier to support NUMA.
-# M:N thread library
+When an event dispatcher passes a task to a worker thread, the user code probably jumps from one CPU core to another, which may need to wait for synchronizations of relevant cachelines, which is not very fast. It would be better that the worker is able to run directly on the CPU core where the event dispatcher runs, since at most of the time, priority of running the worker ASAP is higher than getting new events from the dispatcher. Similarly, it's better to wake up the user thread blocking on RPC on the same CPU core where the response is received.
-To meet these improvements we expect, one option is the M:N thread library, which maps M user threads into N system threads(LWP). Let us see how above problems are solved in this mode:
+## Asynchronous programming
- Each system thread often has a separate runqueue. There may be one or more scheduler to distribute user thread to different runqueue and each system thread will run user thread in their own runqueue in a high priority, then do the global scheduling. This is certainly more complicated, but better than the global mutex/condition.
+Flow controls in asynchronous programming are even difficult for experts. Any suspended operation such as sleeping for a while or waiting for something to finish, implies that users have to save states explicitly and restore states in callbacks. Asynchronous code is often written as state machines. A few suspensions are troublesome, but still handleable. The problem is that once the suspension occurs inside a condition, loop or sub-function, it's almost impossible to write such a state machine being understood and maintained by many people, although the scenario is quite common in distributed systems where a node often needs to interact with multiple nodes simultaneously. In addition, if the wakeup can be triggered by more than one events (such as either fd has data or timeout is reached), the suspension and resuming are prone to race conditions, which require good multi-threaded programming skills to solve. Syntactic sugars(such as lambda) make coding less troublesome rather than reducing difficulty.
- Although the M:N thread library and the multi-threaded reactor are equivalent, the difficulty of coding in a synchronous way is significantly lower than that in an event-driven way and most people can quickly learn the method of synchronous programming.
- There is no need to split a function into several callbacks, you can use RAII.
- When switching from user thread A to user thread B, perhaps we can let B run on the core on which A is running, and let A run on another core, so that B with higher priority is less affected by cache miss.
-Implementing a full-featured M:N thread library is extremely difficult, so M:N thread library is always an active research topic. The M:N thread library we said here is especially for network services. In this context, some of the requirements can be simplified, for example, there are no time slice and priority. Even if there are some kinds of requirement, they are implemented using a simple way and can not be compared with the implementation of operating systems. M:N thread library can be implemented either in the user level or in the kernel level. New languages prefer the implementation in the user level, such as GHC threads and goroutine, which can design new APIs based on thread libraries. The implementation in the mainstream language often have to modify the kernel, such as Windows UMS(https://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx). Although google SwicthTo is 1:1, M:N can be implemented based on it. M:N thread library is more similar to the system thread in usage, which needs locks or message passing to ensure thread safety of the code.
+[RAII](http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) cannot be fully utilized in asynchronous programming. A more common method is to use shared pointers, which seems convenient, but also makes ownerships of memory elusive. If the memory is leaked, it's difficult to locate the code that forgot to release; if segment fault happens, where the double-free occurs is also unknown. Code with a lot of referential countings is hard to remain good-quality and may waste a lot of time on debugging memory related issues. If references are even counted manually, keeping quality of the code is harder and the maintainers are less willing to modify the code. Without RAII, locks need to be locked before a callback and unlocked inside the callback sometimes, which is very error-prone in practice.
\ No newline at end of file