新聞中心
本篇內(nèi)容介紹了“Kubernetes對(duì)Critical Pod的資源搶占原理是什么”的有關(guān)知識(shí),在實(shí)際案例的操作過程中,不少人都會(huì)遇到這樣的困境,接下來就讓小編帶領(lǐng)大家學(xué)習(xí)一下如何處理這些情況吧!希望大家仔細(xì)閱讀,能夠?qū)W有所成!
讓客戶滿意是我們工作的目標(biāo),不斷超越客戶的期望值來自于我們對(duì)這個(gè)行業(yè)的熱愛。我們立志把好的技術(shù)通過有效、簡單的方式提供給客戶,將通過不懈努力成為客戶在信息化領(lǐng)域值得信任、有價(jià)值的長期合作伙伴,公司提供的服務(wù)項(xiàng)目有:域名與空間、虛擬空間、營銷軟件、網(wǎng)站建設(shè)、黃岡網(wǎng)站維護(hù)、網(wǎng)站推廣。
Kubelet Predicate Admit時(shí)對(duì)Critical的資源搶占處理
kubelet 在Predicate Admit流程中,會(huì)對(duì)Pods進(jìn)行各種Predicate準(zhǔn)入檢查,包括GeneralPredicates檢查本節(jié)點(diǎn)是否有足夠的cpu,mem,gpu資源。如果GeneralPredicates準(zhǔn)入檢測(cè)失敗,對(duì)于nonCriticalPod則直接Admit失敗,但如果是CriticalPod則會(huì)觸發(fā)kubelet preemption進(jìn)行資源搶占,按照一定規(guī)則殺死一些Pods釋放資源,搶占成功,則Admit成功。
流程的源頭應(yīng)該從kubelet初始化的流程開始。
pkg/kubelet/kubelet.go:315 // NewMainKubelet instantiates a new Kubelet object along with all the required internal modules. // No initialization of Kubelet and its modules should happen here. func NewMainKubelet(...) (*Kubelet, error) { ... criticalPodAdmissionHandler := preemption.NewCriticalPodAdmissionHandler(klet.GetActivePods, killPodNow(klet.podWorkers, kubeDeps.Recorder), kubeDeps.Recorder) klet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(klet.getNodeAnyWay, criticalPodAdmissionHandler, klet.containerManager.UpdatePluginResources)) // apply functional Option's for _, opt := range kubeDeps.Options { opt(klet) } ... return klet, nil }
在NewMainKubelet對(duì)kubelet進(jìn)行初始化時(shí),通過AddPodAdmitHandler注冊(cè)了criticalPodAdmissionHandler,CriticalPod的Admit的特殊之處就體現(xiàn)在criticalPodAdmissionHandler。
然后,我們進(jìn)入kubelet的predicateAdmitHandler流程中,看看GeneralPredicates失敗后的處理邏輯。
pkg/kubelet/lifecycle/predicate.go:58 func (w *predicateAdmitHandler) Admit(attrs *PodAdmitAttributes) PodAdmitResult { ... fit, reasons, err := predicates.GeneralPredicates(podWithoutMissingExtendedResources, nil, nodeInfo) if err != nil { message := fmt.Sprintf("GeneralPredicates failed due to %v, which is unexpected.", err) glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message) return PodAdmitResult{ Admit: fit, Reason: "UnexpectedAdmissionError", Message: message, } } if !fit { fit, reasons, err = w.admissionFailureHandler.HandleAdmissionFailure(pod, reasons) if err != nil { message := fmt.Sprintf("Unexpected error while attempting to recover from admission failure: %v", err) glog.Warningf("Failed to admit pod %v - %s", format.Pod(pod), message) return PodAdmitResult{ Admit: fit, Reason: "UnexpectedAdmissionError", Message: message, } } } ... return PodAdmitResult{ Admit: true, } }
在kubelet predicateAdmitHandler中對(duì)Pod進(jìn)行GeneralPredicates檢查cpu,mem,gpu資源時(shí),如果發(fā)現(xiàn)資源不足導(dǎo)致Admit失敗,則接著調(diào)用HandleAdmissionFailure進(jìn)行額外處理。前提提到,kubelet初始化時(shí)注冊(cè)了criticalPodAdmissionHandler為HandleAdmissionFailure。
CriticalPodAdmissionHandler struct定義如下:
pkg/kubelet/preemption/preemption.go:41 type CriticalPodAdmissionHandler struct { getPodsFunc eviction.ActivePodsFunc killPodFunc eviction.KillPodFunc recorder record.EventRecorder }
CriticalPodAdmissionHandler的HandleAdmissionFailure方法就是處理CriticalPod特殊的邏輯所在。
pkg/kubelet/preemption/preemption.go:66 // HandleAdmissionFailure gracefully handles admission rejection, and, in some cases, // to allow admission of the pod despite its previous failure. func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(pod *v1.Pod, failureReasons []algorithm.PredicateFailureReason) (bool, []algorithm.PredicateFailureReason, error) { if !kubetypes.IsCriticalPod(pod) || !utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) { return false, failureReasons, nil } // InsufficientResourceError is not a reason to reject a critical pod. // Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist. nonResourceReasons := []algorithm.PredicateFailureReason{} resourceReasons := []*admissionRequirement{} for _, reason := range failureReasons { if r, ok := reason.(*predicates.InsufficientResourceError); ok { resourceReasons = append(resourceReasons, &admissionRequirement{ resourceName: r.ResourceName, quantity: r.GetInsufficientAmount(), }) } else { nonResourceReasons = append(nonResourceReasons, reason) } } if len(nonResourceReasons) > 0 { // Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons. return false, nonResourceReasons, nil } err := c.evictPodsToFreeRequests(admissionRequirementList(resourceReasons)) // if no error is returned, preemption succeeded and the pod is safe to admit. return err == nil, nil, err }
如果Pod不是CriticalPod,或者ExperimentalCriticalPodAnnotation Feature Gate是關(guān)閉的,則直接返回false,表示Admit失敗。
判斷Admit的failureReasons是否包含
predicate.InsufficientResourceError
,如果包含,則調(diào)用evictPodsToFreeRequests觸發(fā)kubelet preemption。注意這里的搶占不同于scheduler preemtion,不要混淆了。
evictPodsToFreeRequests就是kubelet preemption進(jìn)行資源搶占的邏輯實(shí)現(xiàn),其核心就是調(diào)用getPodsToPreempt挑選合適的待殺死的Pods(podsToPreempt)。
pkg/kubelet/preemption/preemption.go:121 // getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirements func getPodsToPreempt(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) { bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pods) // make sure that pods exist to reclaim the requirements unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...) if len(unableToMeetRequirements) > 0 { return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString()) } // find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods. guarateedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...)) if err != nil { return nil, err } // Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods. burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guarateedToEvict...)...)) if err != nil { return nil, err } // Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods. bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guarateedToEvict...)...)) if err != nil { return nil, err } return append(append(bestEffortToEvict, burstableToEvict...), guarateedToEvict...), nil }
kubelet preemtion時(shí)候挑選待殺死Pods的邏輯如下:
如果該P(yáng)od的某個(gè)Resource request quantity超過了現(xiàn)在的所有的bestEffortPods, burstablePods, guaranteedPods的該Resource request quantity,則podsToPreempt為nil,意味著無合適Pods以釋放。
如果釋放所有bestEffortPods, burstablePods的資源都不足夠,則再挑選guaranteedPods(guarateedToEvict)。挑選的規(guī)則是:
規(guī)則一:越少的Pods被釋放越好;
規(guī)則二:釋放的資源越少越好;
規(guī)則一的優(yōu)先級(jí)比規(guī)則二高;
如果釋放所有bestEffortPods及guarateedToEvict的資源都不足夠,則再挑選burstablePods(burstableToEvict)。挑選的規(guī)則同上。
如果釋放所有burstableToEvict及guarateedToEvict的資源都不足夠,則再挑選bestEffortPods(bestEffortToEvict)。挑選的規(guī)則同上。
也就是說:Pod Resource QoS優(yōu)先級(jí)越低的越先被搶占,同一個(gè)QoS Level內(nèi)挑選Pods按照如下規(guī)則:
規(guī)則一:越少的Pods被釋放越好;
規(guī)則二:釋放的資源越少越好;
規(guī)則一的優(yōu)先級(jí)比規(guī)則二高;
Priority Admission Controller對(duì)CriticalPod的特殊處理
我們先看看幾類特殊的、系統(tǒng)預(yù)留的CriticalPod:
ClusterCriticalPod: PriorityClass Name是
system-cluster-critical
的Pod。NodeCriticalPod:PriorityClass Name是
system-node-critical
的Pod。
如果AdmissionController中啟動(dòng)了Priority Admission Controller,那么在創(chuàng)建Pod時(shí)對(duì)Priority的檢查也存在CriticalPod的特殊處理。
Priority Admission Controller主要作用是根據(jù)Pod中指定的PriorityClassName替換成對(duì)應(yīng)的Spec.Pritory數(shù)值。
plugin/pkg/admission/priority/admission.go:138 // admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName. func (p *priorityPlugin) admitPod(a admission.Attributes) error { operation := a.GetOperation() pod, ok := a.GetObject().(*api.Pod) if !ok { return errors.NewBadRequest("resource was marked with kind Pod but was unable to be converted") } // Make sure that the client has not set `priority` at the time of pod creation. if operation == admission.Create && pod.Spec.Priority != nil { return admission.NewForbidden(a, fmt.Errorf("the integer value of priority must not be provided in pod spec. Priority admission controller populates the value from the given PriorityClass name")) } if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) { var priority int32 // TODO: @ravig - This is for backwards compatibility to ensure that critical pods with annotations just work fine. // Remove when no longer needed. if len(pod.Spec.PriorityClassName) == 0 && utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) { pod.Spec.PriorityClassName = scheduling.SystemClusterCritical } if len(pod.Spec.PriorityClassName) == 0 { var err error priority, err = p.getDefaultPriority() if err != nil { return fmt.Errorf("failed to get default priority class: %v", err) } } else { // Try resolving the priority class name. pc, err := p.lister.Get(pod.Spec.PriorityClassName) if err != nil { if errors.IsNotFound(err) { return admission.NewForbidden(a, fmt.Errorf("no PriorityClass with name %v was found", pod.Spec.PriorityClassName)) } return fmt.Errorf("failed to get PriorityClass with name %s: %v", pod.Spec.PriorityClassName, err) } priority = pc.Value } pod.Spec.Priority = &priority } return nil }
同時(shí)滿足以下所有條件時(shí),給Pod的Spec.PriorityClassName
賦值為system-cluster-critical
,即認(rèn)為是ClusterCriticalPod。
如果Enable了ExperimentalCriticalPodAnnotation和PodPriorityFeature Gate;
該P(yáng)od沒有指定PriorityClassName;
該P(yáng)od屬于kube-system namespace;
該P(yáng)od打了
scheduler.alpha.kubernetes.io/critical-pod=""
Annotation;
“Kubernetes對(duì)Critical Pod的資源搶占原理是什么”的內(nèi)容就介紹到這里了,感謝大家的閱讀。如果想了解更多行業(yè)相關(guān)的知識(shí)可以關(guān)注創(chuàng)新互聯(lián)網(wǎng)站,小編將為大家輸出更多高質(zhì)量的實(shí)用文章!
網(wǎng)頁題目:Kubernetes對(duì)CriticalPod的資源搶占原理是什么
鏈接地址:http://fisionsoft.com.cn/article/ipijdh.html