Nginx solution
Conclusion of thundering effect
Regardless of whether it is multi-process or multi-threaded, there is a thrilling group effect. This article uses multi-process analysis.
After the Linux 2.6 version, the thrilling group effect of the system call accept has been solved (provided that the event notification mechanism such as select, poll, epoll, etc. is not used).
At present, Linux has partially solved the thrilling group effect of epoll (epoll is before fork), but Linux 2.6 has not solved it.
The epoll created after the fork still has the thrilling group effect, and Nginx uses its own mutual exclusion lock to solve the thrilling group effect.
What is the thriller effect
The thundering herd is when multiple processes (multithreaded) are simultaneously blocking and waiting for the same event (sleep state). If the waiting event occurs, then he will wake up all the waiting processes (or threads). But in the end, only one process (thread) can obtain "control" of this time and handle the event, while other processes (threads) fail to obtain "control" and can only reenter the dormant state. This phenomenon and Performance waste is called the thrilling group effect.
What does the thriller effect consume?
The Linux kernel frequently does invalid scheduling and context switching on user processes (threads), which greatly reduces system performance. Context switch (context switch) is too high will cause the cpu to be like a porter, frequently running between the register and the run queue, more time is spent in process (thread) switching, rather than in the real working process (thread) Above. Direct consumption includes cpu registers to be saved and loaded (for example, the program counter), and the code of the system scheduler needs to be executed. The indirect consumption lies in the shared data between the multi-core caches.
In order to ensure that only one process (thread) gets the resource, the resource operation needs to be locked and protected, which increases the overhead of the system. At present, some common server software is solved through the lock mechanism, such as Nginx (its lock mechanism is turned on by default and can be turned off); some think that the shock group has little effect on the system performance and do not deal with it, such as Lighttpd.
Accept for Linux solutions
Before Linux 2.6, processes listening to the same socket would hang on the same waiting queue, and when a request came, all waiting processes would be awakened.
After Linux 2.6 version, by introducing a flag bit WQ_FLAG_EXCLUSIVE, the Accept shock group effect is solved.
The specific analysis will be in the code comments, and the accept code implementation snippet is as follows:
// 當(dāng)accept的時(shí)候,如果沒有連接則會(huì)一直阻塞(沒有設(shè)置非阻塞)// 其阻塞函數(shù)就是:inet_csk_accept(accept的原型函數(shù))
struct sock *inet_csk_accept(struct sock *sk, int flags, int *err){
...
// 等待連接
error = inet_csk_wait_for_connect(sk, timeo);
... }static int inet_csk_wait_for_connect(struct sock *sk, long timeo){
...
for (;;) {
// 只有一個(gè)進(jìn)程會(huì)被喚醒。 // 非exclusive的元素會(huì)加在等待隊(duì)列前頭,exclusive的元素會(huì)加在所有非exclusive元素的后頭。 prepare_to_wait_exclusive(sk_sleep(sk), &wait,TASK_INTERRUPTIBLE);
}
...}void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state) {
unsigned long flags;
// 設(shè)置等待隊(duì)列的flag為EXCLUSIVE,設(shè)置這個(gè)就是表示一次只會(huì)有一個(gè)進(jìn)程被喚醒,我們等會(huì)就會(huì)看到這個(gè)標(biāo)記的作用。
// 注意這個(gè)標(biāo)志,喚醒的階段會(huì)使用這個(gè)標(biāo)志。 wait->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&q->lock, flags);
if (list_empty(&wait->task_list))
// 加入等待隊(duì)列
__add_wait_queue_tail(q, wait);
set_current_state(state);
spin_unlock_irqrestore(&q->lock, flags); }
The accept code snippet for waking up blocking is as follows:
// 當(dāng)有tcp連接完成,就會(huì)從半連接隊(duì)列拷貝socket到連接隊(duì)列,這個(gè)時(shí)候我們就可以喚醒阻塞的accept了。int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb){
...
// 關(guān)注此函數(shù) if (tcp_child_process(sk, nsk, skb)) {
rsk = nsk;
goto reset;
}
...}int tcp_child_process(struct sock *parent, struct sock *child, struct sk_buff *skb){
...
// Wakeup parent, send SIGIO 喚醒父進(jìn)程 if (state == TCP_SYN_RECV && child->sk_state != state)
// 調(diào)用sk_data_ready通知父進(jìn)程 // 查閱資料我們知道tcp中這個(gè)函數(shù)對(duì)應(yīng)是sock_def_readable // 而sock_def_readable會(huì)調(diào)用wake_up_interruptible_sync_poll來喚醒隊(duì)列 parent->sk_data_ready(parent, 0);
}
...}void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, void *key) {
...
// 關(guān)注此函數(shù) __wake_up_common(q, mode, nr_exclusive, wake_flags, key);
spin_unlock_irqrestore(&q->lock, flags);
... }static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int wake_flags, void *key){
...
// 傳進(jìn)來的nr_exclusive是1 // 所以flags & WQ_FLAG_EXCLUSIVE為真的時(shí)候,執(zhí)行一次,就會(huì)跳出循環(huán) // 我們記得accept的時(shí)候,加到等待隊(duì)列的元素就是WQ_FLAG_EXCLUSIVE的 list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
unsigned flags = curr->flags;
if (curr->func(curr, mode, wake_flags, key)
&& (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
}
...}
Epoll for Linux solutions
When using IO multiplexing such as select, poll, epoll, kqueue, etc., multi-process (thread) processing links are more complicated.
Therefore, when discussing the thrilling group effect of epoll, it needs to be divided into two situations:
epoll_create在fork之前創(chuàng)建epoll_create在fork之后創(chuàng)建
epoll_create is created before fork
Similar to the reason for accept shocking group, when an event occurs, all processes (threads) waiting for the same file descriptor will be awakened, and the solution idea is the same as accept.
Why do we need to wake up all? Because the kernel does not know whether you are waiting for the file descriptor to call the accept() function, or do other things (signal processing, timing events).
The shocking group effect of this situation has been resolved.
epoll_create is created after fork
If epoll_create is created before fork, all processes share an epoll red and black number.
If we only need to deal with the accept event, it looks like the world is a better place. But epoll does not only process accept events, accept subsequent read and write events need to be processed, as well as timing or signal events.
When the connection comes, we need to select a process to accept. At this time, any accept is possible. When the connection is established, subsequent read and write events are associated with the process. After a request establishes a connection with process a, subsequent reads and writes should also be done by process a.
When a read or write event occurs, which process should be notified? Epoll doesn't know, therefore, the event may notify another process by mistake, which is wrong. Therefore, an epoll event loop mechanism is generally created again in each process (thread), and the read and write events of each process are only registered in the epoll species of its own process.
We know that epoll's repair of the thrilling group effect is based on sharing the same epoll structure. epoll_create is executed after the fork, and each process has a separate epoll red-black tree, waiting queue, and ready event list. Therefore, the thrilling group effect reappeared. Sometimes all processes are awakened, and sometimes some processes are awakened. It may be because the event has been processed by some processes, so there is no need to notify other processes that have not yet been notified.
The design of the lock of the Nginx solution
First of all, we need to know the principle of inter-process lock implementation in user space. The initial principle is very simple, that is, we can make something shared by all processes, such as mmap memory, such as files, and then use this thing to control the mutual exclusion of processes.
The locks used in Nginx are implemented by themselves. The implementation of locks here is divided into two cases. One is the case that supports atomic operations, which is controlled by the macro NGX_HAVE_ATOMIC_OPS, and the other is that atomic operations are not supported. Yes, it is achieved using file locks.
Lock structure
If atomic operations are supported, we can use mmap directly, and then lock saves the address of the memory area of mmap.
If atomic operations are not supported, we use file locks to achieve this, where fd represents the file handle shared between processes, and name represents the file name.
typedef struct {
#if (NGX_HAVE_ATOMIC_OPS)
ngx_atomic_t *lock;
#else
ngx_fd_t fd;
u_char *name;
#endif
} ngx_shmtx_t;
Atomic lock creation
// If atomic operations are supported, it is very simple, that is, the address of the shared memory is paid to the loc fieldngx_int_t ngx_shmtx_create(ngx_shmtx_t *mtx, void *addr, u_char *name) {
mtx->lock = addr;
return NGX_OK; }
Atomic lock acquisition
trylock is non-blocking, which means that it will try to acquire the lock, and if it is not acquired, it will directly return an error.
It will also try to acquire the lock, and when it is not acquired, it will not return immediately. Instead, it will start to enter the loop and then keep acquiring the lock, knowing that it is acquired. However, Nginx also uses a trick here, that is, every time the current process is placed in the last position of the cpu's run queue, that is, the cpu is automatically abandoned.
Atomic lock implementation
If the system library supports the situation, call OSAtomicCompareAndSwap32Barrier or CAS directly at this time.
#define ngx_atomic_cmp_set(lock, old, new)
OSAtomicCompareAndSwap32Barrier(old, new, (int32_t *) lock)
If the system library does not support this instruction, Nginx itself has implemented one in assembly.
static ngx_inline ngx_atomic_uint_t ngx_atomic_cmp_set(ngx_atomic_t *lock, ngx_atomic_uint_t old, ngx_atomic_uint_t set) { u_char res; __asm__ volatile ( NGX_SMP_LOCK " cmpxchgl %3, %1; " " sete %0; " : "=a" (res) : "m" (*lock), "a" (old), "r" (set) : "cc", "memory"); return res; }
Atomic lock release
Unlock is relatively simple. Compare it with the current process id. If it is equal, change the lock to 0, indicating that the lock is abandoned.
#define ngx_shmtx_unlock(mtx)
(void) ngx_atomic_cmp_set((mtx)->lock, ngx_pid, 0)
The shocking group effect of Nginx solutions
Variable analysis// 如果使用了master worker,并且worker個(gè)數(shù)大于1 // 同時(shí)配置文件里面有設(shè)置使用accept_mutex.的話,設(shè)置ngx_use_accept_mutex
if (ccf->master && ccf->worker_processes > 1 && ecf->accept_mutex)
{
ngx_use_accept_mutex = 1;
// 下面這兩個(gè)變量后面會(huì)解釋。
ngx_accept_mutex_held = 0;
ngx_accept_mutex_delay = ecf->accept_mutex_delay;
} else {
ngx_use_accept_mutex = 0;
}ngx_use_accept_mutex:這個(gè)變量,如果有這個(gè)變量,說明nginx有必要使用accept互斥體,這個(gè)變量的初始化在ngx_event_process_init中。ngx_accept_mutex_held:表示當(dāng)前是否已經(jīng)持有鎖。ngx_accept_mutex_delay:表示當(dāng)獲得鎖失敗后,再次去請(qǐng)求鎖的間隔時(shí)間,這個(gè)時(shí)間可以在配置文件中設(shè)置的。ngx_accept_disabled = ngx_cycle->connection_n / 8
- ngx_cycle->free_connection_n;ngx_accept_disabled:這個(gè)變量是一個(gè)閾值,如果大于0,說明當(dāng)前的進(jìn)程處理的連接過多。
Whether to use lock
// 如果有使用mutex,則才會(huì)進(jìn)行處理。
if (ngx_use_accept_mutex) {
// 如果大于0,則跳過下面的鎖的處理,并減一。
if (ngx_accept_disabled > 0) {
ngx_accept_disabled--;
} else {
// 試著獲得鎖,如果出錯(cuò)則返回。
if (ngx_trylock_accept_mutex(cycle) == NGX_ERROR) {
return;
}
// 如果ngx_accept_mutex_held為1,則說明已經(jīng)獲得鎖,此時(shí)設(shè)置flag,這個(gè)flag后面會(huì)解釋。 if (ngx_accept_mutex_held) {
flags |= NGX_POST_EVENTS;
} else {
// 否則,設(shè)置timer,也就是定時(shí)器。接下來會(huì)解釋這段。
if (timer == NGX_TIMER_INFINITE
|| timer > ngx_accept_mutex_delay) {
timer = ngx_accept_mutex_delay;
}
}
} }// 如果ngx_posted_accept_events不為NULL,則說明有accept event需要nginx處理。
if (ngx_posted_accept_events) {
ngx_event_process_posted(cycle, &ngx_posted_accept_events); }
NGX_POST_EVENTS flag. Setting this flag means that when the socket is awakened by data, we will not accept or read it immediately, but save the event, and then accept or read after we release the lock. This handle.
If the NGX_POST_EVENTS flag is not set, nginx will immediately accept or read the handle.
Timer Here, if nginx does not obtain the lock, it will not immediately obtain the lock, but set the timer, and then sleep in epoll (if there is no other thing to wake up). At this time, if a connection arrives, the current sleep process will be advanced Wake up, then immediately accept. Otherwise, sleep for ngx_accept_mutex_delay time, and then continue to try lock.
Get the lock to solve the shock group
ngx_int_t ngx_trylock_accept_mutex(ngx_cycle_t *cycle) {
// 嘗試獲得鎖
if (ngx_shmtx_trylock(&ngx_accept_mutex)) {
// 如果本來已經(jīng)獲得鎖,則直接返回Ok
if (ngx_accept_mutex_held
&& ngx_accept_events == 0
&& !(ngx_event_flags & NGX_USE_RTSIG_EVENT))
{
return NGX_OK;
}
// 到達(dá)這里,說明重新獲得鎖成功,因此需要打開被關(guān)閉的listening句柄。
if (ngx_enable_accept_events(cycle) == NGX_ERROR) {
ngx_shmtx_unlock(&ngx_accept_mutex);
return NGX_ERROR;
}
ngx_accept_events = 0;
// 設(shè)置獲得鎖的標(biāo)記。
ngx_accept_mutex_held = 1;
return NGX_OK;
}
// 如果我們前面已經(jīng)獲得了鎖,然后這次獲得鎖失敗 // 則說明當(dāng)前的listen句柄已經(jīng)被其他的進(jìn)程鎖監(jiān)聽 // 因此此時(shí)需要從epoll中移出調(diào)已經(jīng)注冊的listen句柄 // 這樣就很好的控制了子進(jìn)程的負(fù)載均衡
if (ngx_accept_mutex_held) {
if (ngx_disable_accept_events(cycle) == NGX_ERROR) {
return NGX_ERROR;
}
// 設(shè)置鎖的持有為0.
ngx_accept_mutex_held = 0;
}
return NGX_OK; }
As in the above code, when a connection comes, there is the fd in the epoll event list of each process at this time. The process that grabs the connection releases the lock first, and then accepts. The process that has not been grabbed removes the fd from the event list and does not need to call accept again, resulting in a waste of resources. At the same time, due to the control of the lock (and the timer for obtaining the lock), each process can accept the handle relatively fairly, which is a better solution to the load balancing of the child processes.