性感美女极度性感视频,mmm美女视频,韩国19禁美女直播视频福利

linux調(diào)度器源碼分析4/4--運行

心不留意外塵 >《task sys》

2016.05.11

關注

http://blog.chinaunix.net/uid-26772321-id-4904090.html

2015

引言

　　之前的文章已經(jīng)將調(diào)度器的數(shù)據(jù)結(jié)構(gòu)、初始化、加入進程都進行了分析，這篇文章將主要說明調(diào)度器是如何在程序穩(wěn)定運行的情況下進行進程調(diào)度的。

系統(tǒng)定時器

　　因為我們主要講解的是調(diào)度器，而會涉及到一些系統(tǒng)定時器的知識，這里我們簡單講解一下內(nèi)核中定時器是如何組織，又是如何通過通過定時器實現(xiàn)了調(diào)度器的間隔調(diào)度。首先我們先看一下內(nèi)核定時器的框架

　　在內(nèi)核中，會使用strut clock_event_device結(jié)構(gòu)描述硬件上的定時器，每個硬件定時器都有其自己的精度，會根據(jù)精度每隔一段時間產(chǎn)生一個時鐘中斷。而系統(tǒng)會讓每個CPU使用一個tick_device描述系統(tǒng)當前使用的硬件定時器(因為每個CPU都有其自己的運行隊列)，通過tick_device所使用的硬件時鐘中斷進行時鐘滴答(jiffies)的累加(只會有一個CPU負責這件事)，并且在中斷中也會調(diào)用調(diào)度器，而我們在驅(qū)動中常用的低精度定時器就是通過判斷jiffies實現(xiàn)的。而當使用高精度定時器(hrtimer)時，情況則不一樣，hrtimer會生成一個普通的高精度定時器，在這個定時器中回調(diào)函數(shù)是調(diào)度器，其設置的間隔時間同時鐘滴答一樣。

　　所以在系統(tǒng)中，每一次時鐘滴答都會使調(diào)度器判斷一次是否需要進行調(diào)度。

時鐘中斷

　　當時鐘發(fā)生中斷時，首先會調(diào)用的是tick_handle_periodic()函數(shù)，在此函數(shù)中又主要執(zhí)行tick_periodic()函數(shù)進行操作。我們先看一下tick_handle_periodic()函數(shù)：

void tick_handle_periodic(struct clock_event_device *dev)
{
/* 獲取當前CPU */
int cpu = smp_processor_id();
/* 獲取下次時鐘中斷執(zhí)行時間 */
ktime_t next = dev->next_event;
tick_periodic(cpu);
/* 如果是周期觸發(fā)模式，直接返回 */
if (dev->mode != CLOCK_EVT_MODE_ONESHOT)
return;
/* 為了防止當該函數(shù)被調(diào)用時，clock_event_device中的計時實際上已經(jīng)經(jīng)過了不止一個tick周期，這時候，tick_periodic可能被多次調(diào)用，使得jiffies和時間可以被正確地更新。 */
for (;;) {
/*
* Setup the next period for devices, which do not have
* periodic mode:
*/
/* 計算下一次觸發(fā)時間 */
next = ktime_add(next, tick_period);
/* 設置下一次觸發(fā)時間，返回0表示成功 */
if (!clockevents_program_event(dev, next, false))
return;
/*
* Have to be careful here. If we're in oneshot mode,
* before we call tick_periodic() in a loop, we need
* to be sure we're using a real hardware clocksource.
* Otherwise we could get trapped in an infinite(無限的)
* loop, as the tick_periodic() increments jiffies,
* which then will increment time, possibly causing
* the loop to trigger again and again.
*/
if (timekeeping_valid_for_hres())
tick_periodic(cpu);
}
}

　　此函數(shù)主要工作是執(zhí)行tick_periodic()函數(shù)，然后判斷時鐘中斷是單觸發(fā)模式還是循環(huán)觸發(fā)模式，如果是循環(huán)觸發(fā)模式，則直接返回，如果是單觸發(fā)模式，則執(zhí)行如下操作：

計算下一次觸發(fā)時間
設置下次觸發(fā)時間
如果設置下次觸發(fā)時間失敗，則根據(jù)timekeeper等待下次tick_periodic()函數(shù)執(zhí)行時間。
返回第一步

　　而在tick_periodic()函數(shù)中，程序主要執(zhí)行路線為tick_periodic()->update_process_times()->scheduler_tick()。最后的scheduler_tick()函數(shù)則是跟調(diào)度相關的主要函數(shù)。我們在這具體先看看tick_periodic()函數(shù)和update_process_times()函數(shù)：

/* tick_device 周期性調(diào)用此函數(shù)
* 更新jffies和當前進程
* 只有一個CPU是負責更新jffies的，其他的CPU只會更新當前自己的進程
*/
static void tick_periodic(int cpu)
{
if (tick_do_timer_cpu == cpu) {
/* 當前CPU負責更新時間 */
write_seqlock(&jiffies_lock);
/* Keep track of the next tick event */
tick_next_period = ktime_add(tick_next_period, tick_period);
/* 更新 jiffies計數(shù)，jiffies += 1 */
do_timer(1);
write_sequnlock(&jiffies_lock);
/* 更新墻上時間，就是我們生活中的時間 */
update_wall_time();
}
/* 更新當前進程信息，調(diào)度器主要函數(shù) */
update_process_times(user_mode(get_irq_regs()));
profile_tick(CPU_PROFILING);
}
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id();
/* Note: this timer irq context must be accounted for as well. */
/* 更新當前進程的內(nèi)核態(tài)和用戶態(tài)占用率 */
account_process_tick(p, user_tick);
/* 檢查有沒有定時器到期，有就運行到期定時器的處理 */
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
#ifdef CONFIG_IRQ_WORK
if (in_irq())
irq_work_tick();
#endif
/* 調(diào)度器的tick */
scheduler_tick();
run_posix_cpu_timers(p);
}

　　這兩個函數(shù)主要工作為將jiffies加1、更新系統(tǒng)的墻上時間、更新當前進程的內(nèi)核態(tài)和用戶態(tài)的CPU占用率、檢查是否有定時器到期，運行到期的定時器。當執(zhí)行完這些操作后，就到了最重要的scheduler_tick()函數(shù)，而scheduler_tick()函數(shù)主要做什么呢，就是更新CPU和當前進行的一些數(shù)據(jù)，然后根據(jù)當前進程的調(diào)度類，調(diào)用task_tick()函數(shù)。這里普通進程調(diào)度類的task_tick()是task_tick_fair()函數(shù)。

void scheduler_tick(void)
{
/* 獲取當前CPU的ID */
int cpu = smp_processor_id();
/* 獲取當前CPU的rq隊列 */
struct rq *rq = cpu_rq(cpu);
/* 獲取當前CPU的當前運行程序，實際上就是current */
struct task_struct *curr = rq->curr;
/* 更新CPU調(diào)度統(tǒng)計中的本次調(diào)度時間 */
sched_clock_tick();
raw_spin_lock(&rq->lock);
/* 更新該CPU的rq運行時間 */
update_rq_clock(rq);
curr->sched_class->task_tick(rq, curr, 0);
/* 更新CPU的負載 */
update_cpu_load_active(rq);
raw_spin_unlock(&rq->lock);
perf_event_task_tick();
#ifdef CONFIG_SMP
rq->idle_balance = idle_cpu(cpu);
trigger_load_balance(rq);
#endif
/* rq->last_sched_tick = jiffies; */
rq_last_tick_reset(rq);
}
/*
* CFS調(diào)度類的task_tick()
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
/* 向上更新進程組時間片 */
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
/* 更新當前進程運行時間，并判斷是否需要調(diào)度此進程 */
entity_tick(cfs_rq, se, queued);
}
if (numabalancing_enabled)
task_tick_numa(rq, curr);
update_rq_runnable_avg(rq, 1);
}

　　顯然，到這里最重要的函數(shù)應該是entity_tick()，因為是這個函數(shù)決定了當前進程是否需要調(diào)度出去。我們必須先明確一點就是，CFS調(diào)度策略是使用紅黑樹以進程的vruntime為鍵值進行組織的，進程的vruntime越小越在紅黑樹的左邊，而每次調(diào)度的下一個目標就是紅黑樹最左邊的結(jié)點上的進程。而當進行運行時，其vruntime是隨著實際運行時間而增加的，但是不同權(quán)重的進程其vruntime增加的速率不同，正在運行的進程的權(quán)重約大(優(yōu)先級越高)，其vruntime增加的速率越慢，所以其所占用的CPU時間越多。而每次時鐘中斷的時候，在entity_tick()函數(shù)中都會更新當前進程的vruntime值。當進程沒有處于CPU上運行時，其vruntime是保持不變的。

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
/* 更新當前進程運行時間，包括虛擬運行時間 */
update_curr(cfs_rq);
/*
* Ensure that runnable average is periodically updated.
*/
update_entity_load_avg(curr, 1);
update_cfs_rq_blocked_load(cfs_rq, 1);
update_cfs_shares(cfs_rq);
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
/* 若queued為1，則當前運行隊列的運行進程需要調(diào)度 */
if (queued) {
/* 標記當前進程需要被調(diào)度出去 */
resched_curr(rq_of(cfs_rq));
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) && hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
/* 檢查是否需要調(diào)度 */
if (cfs_rq->nr_running > 1)
check_preempt_tick(cfs_rq, curr);
}

　　之后的文章會詳細說說CFS關于進程的vruntime的處理，現(xiàn)在只需要知道是這樣就好，在entity_tick()中，首先會更新當前進程的實際運行時間和虛擬運行時間，這里很重要，因為要使用更新后的這些數(shù)據(jù)去判斷是否需要被調(diào)度。在entity_tick()函數(shù)中最后面的check_preempt_tick()函數(shù)就是用來判斷進程是否需要被調(diào)度的，其判斷的標準有兩個：

先判斷當前進程的實際運行時間是否超過CPU分配給這個進程的CPU時間，如果超過，則需要調(diào)度。
再判斷當前進程的vruntime是否大于下個進程的vruntime，如果大于，則需要調(diào)度。

　　清楚了這兩個標準，check_preempt_tick()的代碼則很好理解了。

/*
* 檢查當前進程是否需要被搶占
* 判斷方法有兩種，一種就是判斷當前進程是否超過了CPU分配給它的實際運行時間
* 另一種就是判斷當前進程的虛擬運行時間是否大于下個進程的虛擬運行時間
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/* ideal_runtime為進程應該運行的時間
* delta_exec為進程增加的實際運行時間
* 如果delta_exec超過了ideal_runtime，表示該進程應該讓出CPU給其他進程
*/
unsigned long ideal_runtime, delta_exec;
struct sched_entity *se;
s64 delta;
/* slice為CFS隊列中所有進程運行一遍需要的實際時間 */
/* ideal_runtime保存的是CPU分配給當前進程一個周期內(nèi)實際的運行時間，計算公式為: 一個周期內(nèi)進程應當運行的時間 = 一個周期內(nèi)隊列中所有進程運行一遍需要的時間 * 當前進程權(quán)重 / 隊列總權(quán)重
* delta_exec保存的是當前進程增加使用的實際運行時間
*/
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime) {
/* 增加的實際運行實際 > 應該運行實際，說明需要調(diào)度出去 */
resched_curr(rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
*/
/* 清空cfs_rq隊列的last，next，skip指針 */
clear_buddies(cfs_rq, curr);
return;
}
/*
* Ensure that a task that missed wakeup preemption by a
* narrow margin doesn't have to wait for a full slice.
* This also mitigates buddy induced latencies under load.
*/
if (delta_exec < sysctl_sched_min_granularity)
return;
/* 獲取下一個調(diào)度進程的se */
se = __pick_first_entity(cfs_rq);
/* 當前進程的虛擬運行時間 - 下個進程的虛擬運行時間 */
delta = curr->vruntime - se->vruntime;
/* 當前進程的虛擬運行時間大于下個進程的虛擬運行時間，說明這個進程還可以繼續(xù)運行 */
if (delta < 0)
return;
if (delta > ideal_runtime)
/* 當前進程的虛擬運行時間小于下個進程的虛擬運行時間，說明下個進程比當前進程更應該被CPU使用，resched_curr()函數(shù)用于標記當前進程需要被調(diào)度出去 */
resched_curr(rq_of(cfs_rq));
}
/*
* resched_curr - mark rq's current task 'to be rescheduled now'.
*
* On UP this means the setting of the need_resched flag, on SMP it
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
/* 標記當前進程需要調(diào)度，將當前進程的thread_info->flags設置TIF_NEED_RESCHED標記 */
void resched_curr(struct rq *rq)
{
struct task_struct *curr = rq->curr;
int cpu;
lockdep_assert_held(&rq->lock);
/* 檢查當前進程是否已經(jīng)設置了調(diào)度標志，如果是，則不用再設置一遍，直接返回 */
if (test_tsk_need_resched(curr))
return;
/* 根據(jù)rq獲取CPU */
cpu = cpu_of(rq);
/* 如果CPU = 當前CPU，則設置當前進程需要調(diào)度標志 */
if (cpu == smp_processor_id()) {
/* 設置當前進程需要被調(diào)度出去的標志，這個標志保存在進程的thread_info結(jié)構(gòu)上 */
set_tsk_need_resched(curr);
/* 設置CPU的內(nèi)核搶占 */
set_preempt_need_resched();
return;
}
/* 如果不是處于當前CPU上，則設置當前進程需要調(diào)度，并通知其他CPU */
if (set_nr_and_not_polling(curr))
smp_send_reschedule(cpu);
else
trace_sched_wake_idle_without_ipi(cpu);
}

　　好了，到這里實際上如果進程需要被調(diào)度，則已經(jīng)被標記，如果進程不需要被調(diào)度，則繼續(xù)執(zhí)行。這里大家或許有疑問，只標記了進程需要被調(diào)度，但是為什么并沒有真正處理它？其實根據(jù)我的博文linux調(diào)度器源碼分析 - 概述(一)所說，進程調(diào)度的發(fā)生時機之一就是發(fā)生在中斷返回時，這里是在匯編代碼中實現(xiàn)的，而我們知道這里我們是時鐘中斷執(zhí)行上述的這些操作的，當執(zhí)行完這些后，從時鐘中斷返回去的時候，會調(diào)用到匯編函數(shù)ret_from_sys_call，在這個函數(shù)中會先檢查調(diào)度標志被置位，如果被置位，則跳轉(zhuǎn)至schedule()，而schedule()最后調(diào)用到__schedule()這個函數(shù)進行處理。

static void __sched __schedule(void)
{
/* prev保存換出進程(也就是當前進程)，next保存換進進程 */
struct task_struct *prev, *next;
unsigned long *switch_count;
struct rq *rq;
int cpu;
need_resched:
/* 禁止搶占 */
preempt_disable();
/* 獲取當前CPU ID */
cpu = smp_processor_id();
/* 獲取當前CPU運行隊列 */
rq = cpu_rq(cpu);
rcu_note_context_switch(cpu);
prev = rq->curr;
schedule_debug(prev);
if (sched_feat(HRTICK))
hrtick_clear(rq);
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
* done by the caller to avoid the race with signal_wake_up().
*/
smp_mb__before_spinlock();
/* 隊列上鎖 */
raw_spin_lock_irq(&rq->lock);
/* 當前進程非自愿切換次數(shù) */
switch_count = &prev->nivcsw;
/*
* 當內(nèi)核搶占時會置位thread_info的preempt_count的PREEMPT_ACTIVE位，調(diào)用schedule()之后會清除，PREEMPT_ACTIVE置位表明是從內(nèi)核搶占進入到此的
* preempt_count()是判斷thread_info的preempt_count整體是否為0
* prev->state大于0表明不是TASK_RUNNING狀態(tài)
*
*/
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
/* 當前進程不為TASK_RUNNING狀態(tài)并且不是通過內(nèi)核態(tài)搶占進入調(diào)度 */
if (unlikely(signal_pending_state(prev->state, prev))) {
/* 有信號需要處理，置為TASK_RUNNING */
prev->state = TASK_RUNNING;
} else {
/* 沒有信號掛起需要處理，會將此進程移除運行隊列 */
/* 如果代碼執(zhí)行到此，說明當前進程要么準備退出，要么是處于即將睡眠狀態(tài) */
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;
/*
* If a worker went to sleep, notify and ask workqueue
* whether it wants to wake up a task to maintain
* concurrency.
*/
if (prev->flags & PF_WQ_WORKER) {
/* 如果當前進程處于一個工作隊列中 */
struct task_struct *to_wakeup;
to_wakeup = wq_worker_sleeping(prev, cpu);
if (to_wakeup)
try_to_wake_up_local(to_wakeup);
}
}
switch_count = &prev->nvcsw;
}
/* 更新rq運行隊列時間 */
if (task_on_rq_queued(prev) || rq->skip_clock_update < 0)
update_rq_clock(rq);
/* 獲取下一個調(diào)度實體，這里的next的值會是一個進程，而不是一個調(diào)度組，在pick_next_task會遞歸選出一個進程 */
next = pick_next_task(rq, prev);
/* 清除當前進程的thread_info結(jié)構(gòu)中的flags的TIF_NEED_RESCHED和PREEMPT_NEED_RESCHED標志位，這兩個位表明其可以被調(diào)度調(diào)出(因為這里已經(jīng)調(diào)出了，所以這兩個位就沒必要了) */
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
rq->skip_clock_update = 0;
if (likely(prev != next)) {
/* 該CPU進程切換次數(shù)加1 */
rq->nr_switches++;
/* 該CPU當前執(zhí)行進程為新進程 */
rq->curr = next;
++*switch_count;
/* 這里進行了進程上下文的切換 */
context_switch(rq, prev, next); /* unlocks the rq */
/*
* The context switch have flipped the stack from under us
* and restored the local variables which were saved when
* this task called schedule() in the past. prev == current
* is still correct, but it can be moved to another cpu/rq.
*/
/* 新的進程有可能在其他CPU上運行，重新獲取一次CPU和rq */
cpu = smp_processor_id();
rq = cpu_rq(cpu);
}
else
raw_spin_unlock_irq(&rq->lock); /* 這里意味著下個調(diào)度的進程就是當前進程，釋放鎖不做任何處理 */
/* 上下文切換后的處理 */
post_schedule(rq);
/* 重新打開搶占使能但不立即執(zhí)行重新調(diào)度 */
sched_preempt_enable_no_resched();
if (need_resched())
goto need_resched;
}

　　在__schedule()中，每一步的作用注釋已經(jīng)寫得很詳細了，選取下一個進程的任務在__schedule()中交給了pick_next_task()函數(shù)，而進程切換則交給了context_switch()函數(shù)。我們先看看pick_next_task()函數(shù)是如何選取下一個進程的：

static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev)
{
const struct sched_class *class = &fair_sched_class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
if (likely(prev->sched_class == class && rq->nr_running == rq->cfs.h_nr_running)) {
/* 所有進程都處于CFS運行隊列中，所以就直接使用cfs的調(diào)度類 */
p = fair_sched_class.pick_next_task(rq, prev);
if (unlikely(p == RETRY_TASK))
goto again;
/* assumes fair_sched_class->next == idle_sched_class */
if (unlikely(!p))
p = idle_sched_class.pick_next_task(rq, prev);
return p;
}
again:
/* 在其他調(diào)度類中包含有其他進程，從最高優(yōu)先級的調(diào)度類迭代到最低優(yōu)先級的調(diào)度類，并選擇最優(yōu)的進程運行 */
for_each_class(class) {
p = class->pick_next_task(rq, prev);
if (p) {
if (unlikely(p == RETRY_TASK))
goto again;
return p;
}
}
BUG(); /* the idle class will always have a runnable task */
}

　　在pick_next_task()中完全體現(xiàn)了進程優(yōu)先級的概念，首先會先判斷是否所有進程都處于cfs隊列中，如果不是，則表明有比普通進程更高優(yōu)先級的進程(包括實時進程)。內(nèi)核中是將調(diào)度類重優(yōu)先級高到低進行排列，然后選擇時從最高優(yōu)先級的調(diào)度類開始找是否有進程需要調(diào)度，如果沒有會轉(zhuǎn)到下一優(yōu)先級調(diào)度類，在代碼27行所體現(xiàn)，27行展開是