一文詳解kubernetes?中資源分配的那些事

更新時間：2023年04月23日 11:35:01 作者：俯仰之間

這篇文章主要為大家介紹了kubernetes?中資源分配的那些事，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進步，早日升職加薪

概要

在k8s中，kube-scheduler是Kubernetes中的調(diào)度器，用于將Pod調(diào)度到可用的節(jié)點上。在調(diào)度過程中，kube-scheduler需要了解節(jié)點和Pod的資源需求和可用性情況，其中CPU和內(nèi)存是最常見的資源需求。那么這些資源的使用率是怎么來的呢？當(dāng)Pod調(diào)度到節(jié)點上后，系統(tǒng)是如何約束Pod的資源使用而不影響其他Pod的？當(dāng)資源使用率達到了申請的資源時，會發(fā)生什么？下面，我們就這些問題，詳細展開說說。閱讀本文，你將了解到

k8s調(diào)度Pod時，節(jié)點的資源使用率是怎么來的
k8s中配置的cpu的limit, request在節(jié)點上具體是通過什么參數(shù)來約束Pod的資源使用的
什么是empheral-storage資源，有什么用
kubelet配置中關(guān)于資源管理的那些參數(shù)該怎么配置

用過k8s的同學(xué)應(yīng)該都知道，我們在配置deployment的時候，我們一般都會為cpu和內(nèi)存配置limit和request，那么這個配置具體在節(jié)點上是怎么限制的呢？

一個nginx的配置

cpu的request、limit分別是1個核和4個核，內(nèi)存的request、limit分別是1Gi和4Gi（Gi=1024Mi,G=1000Mi）。我們都知道，資源的限制時使用cgroup實現(xiàn)的，那么Pod的資源是怎么實現(xiàn)的呢？我們?nèi)od所在的節(jié)點看下。

k8s的cpu限制的cgroup目錄在 /sys/fs/cgroup/cpu/kubepods ，該目錄內(nèi)容如下

我們能看到besteffort 和 burstable兩個目錄，這兩個目錄涉及Pod的QoS

QoS（Quality of Service），大部分譯為“服務(wù)質(zhì)量等級”，又譯作“服務(wù)質(zhì)量保證”，是作用在 Pod 上的一個配置，當(dāng) Kubernetes 創(chuàng)建一個 Pod 時，它就會給這個 Pod 分配一個 QoS 等級，可以是以下等級之一：

Guaranteed：Pod 里的每個容器都必須有內(nèi)存/CPU 限制和請求，而且值必須相等。
Burstable：Pod 里至少有一個容器有內(nèi)存或者 CPU 請求且不滿足 Guarantee 等級的要求，即內(nèi)存/CPU 的值設(shè)置的不同。
BestEffort：容器必須沒有任何內(nèi)存或者 CPU 的限制或請求。

這個東西的作用就是，當(dāng)節(jié)點上出現(xiàn)資源壓力的時候，會根據(jù)QoS的等級順序進行驅(qū)逐，驅(qū)逐順序為Guaranteed<Burstable<BestEffort。

我們的nginx對資源要求的配置根據(jù)上面的描述可以看到是Burstable類型的

我們進該目錄看下：

里面的目錄表示屬于Burstable類型的Pod的cpu cgroup都配置在這個目錄，再進到Pod所在目錄，可以看到有2個目錄，每個目錄是容器的cpu cgroup目錄，一個是nginx本身的，另外一個Infra容器（沙箱容器）。

我們進入nginx容器所在目錄看下

我們重點看下紅框內(nèi)的三個文件的含義。

cpu.shares

cpu.shares用來設(shè)置CPU的相對值，并且是針對所有的CPU（內(nèi)核），默認值是1024等同于一個cpu核心。 CPU Shares將每個核心劃分為1024個片，并保證每個進程將按比例獲得這些片的份額。如果有1024個片(即1核)，并且兩個進程設(shè)置cpu.shares均為1024，那么這兩個進程中每個進程將獲得大約一半的cpu可用時間。

當(dāng)系統(tǒng)中有兩個cgroup，分別是A和B，A的shares值是1024，B 的shares值是512，那么A將獲得1024/(1024+512)=66%的CPU資源，而B將獲得33%的CPU資源。shares有兩個特點：

如果A不忙，沒有使用到66%的CPU時間，那么剩余的CPU時間將會被系統(tǒng)分配給B，即B的CPU使用率可以超過33%。
如果添加了一個新的cgroup C，且它的shares值是1024，那么A的限額變成了1024/(1024+512+1024)=40%，B的變成了20%。

從上面兩個特點可以看出：

在閑的時候，shares不起作用，只有在CPU忙的時候起作用。

由于shares是一個絕對值，單單看某個組的share是沒有意義的，需要和其它cgroup的值進行比較才能得到自己的相對限額，而在一個部署很多容器的機器上，cgroup的數(shù)量是變化的，所以這個限額也是變化的，自己設(shè)置了一個高的值，但別人可能設(shè)置了一個更高的值，所以這個功能沒法精確的控制CPU使用率。從share這個單詞（共享的意思）的意思，我們也能夠體會到這一點。

cpu.shares對應(yīng)k8s內(nèi)的resources.requests.cpu字段，值對應(yīng)關(guān)系為：resources.requests.cpu * 1024 = cpu.share

cpu.cpu.cfs_period_us、cpu.cfs_quota_us

cpu.cfs_period_us用來配置時間周期長度，cpu.cfs_quota_us用來配置當(dāng)前cgroup在設(shè)置的周期長度內(nèi)所能使用的CPU時間數(shù)。兩個文件配合起來設(shè)置CPU的使用上限。兩個文件的單位都是微秒（us），cfs_period_us的取值范圍為1毫秒（ms）到1秒（s），cfs_quota_us的取值大于1ms即可，如果cfs_quota_us的值為-1（默認值），表示不受cpu時間的限制。

cpu.cpu.cfs_period_us、cpu.cfs_quota_us對應(yīng)k8s中的resources.limits.cpu字段：resources.limits.cpu = cpu.cfs_quota_us/cpu.cfs_period_us

可以看到上面的nginx的這兩個比值正好是4，表示nginx最多可以分配到4個CPU。此時，就算系統(tǒng)很空閑，上面說的share沒有發(fā)揮作用，也不會分配超時4個CPU，這就是上限的限制。

在平時配置的時候，limit和request兩者最好不要相差過大，否則節(jié)點CPU容易出現(xiàn)超賣情況，如limit/request=4，那么在調(diào)度的時候發(fā)現(xiàn)節(jié)點是有資源的，一旦調(diào)度完成后，Pod可能會由于跑出超過request的CPU，那么節(jié)點其他Pod可能就會出現(xiàn)資源”饑餓“情況，反映到業(yè)務(wù)就是請求反應(yīng)慢。CPU 屬于可壓縮資源，內(nèi)存屬于不可壓縮資源。當(dāng)可壓縮資源不足時，Pod 會饑餓，但是不會退出；當(dāng)不可壓縮資源不足時，Pod 就會因為 OOM 被內(nèi)核殺掉。

資源使用率數(shù)據(jù)來源

這個問題還得從源碼入手，首先我們看看kube-scheduler在調(diào)度的時候?qū)τ谫Y源的判斷都做了哪些事。kube-scheduler會使用informer監(jiān)聽集群內(nèi)node的變化，如果有變化（如Node的狀態(tài)，Node的資源情況等），則調(diào)用事件函數(shù)寫入本地index store(cache)中，代碼如下：

func addAllEventHandlers{
  ...
    informerFactory.Core().V1().Nodes().Informer().AddEventHandler(
      cache.ResourceEventHandlerFuncs{
        AddFunc:    sched.addNodeToCache,
        UpdateFunc: sched.updateNodeInCache,
        DeleteFunc: sched.deleteNodeFromCache,
      },
    )
  ...

如上，如果集群內(nèi)加入了新節(jié)點，則會調(diào)用addNodeToCache函數(shù)將Node信息加入本地緩存，那么咱們來看看addNodeToCache函數(shù)：

func (sched *Scheduler) addNodeToCache(obj interface{}) {
  node, ok := obj.(*v1.Node)
  if !ok {
    klog.ErrorS(nil, "Cannot convert to *v1.Node", "obj", obj)
    return
  }
  nodeInfo := sched.Cache.AddNode(node)
  klog.V(3).InfoS("Add event for node", "node", klog.KObj(node))
  sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(queue.NodeAdd, preCheckForNode(nodeInfo))
}

從上面的代碼我們看到，在把該節(jié)點加入cache后，還會調(diào)用MoveAllToActiveOrBackoffQueue 函數(shù)，對在 unschedulablePods (還沒有調(diào)度的Pod隊列)中的Pod進行一次Precheck ，如果MoveAllToActiveOrBackoffQueue** 函數(shù)如下

func (p *PriorityQueue) MoveAllToActiveOrBackoffQueue(event framework.ClusterEvent, preCheck PreEnqueueCheck) {
  p.lock.Lock()
  defer p.lock.Unlock()
  unschedulablePods := make([]*framework.QueuedPodInfo, 0, len(p.unschedulablePods.podInfoMap))
  for _, pInfo := range p.unschedulablePods.podInfoMap {
    if preCheck == nil || preCheck(pInfo.Pod) {
      unschedulablePods = append(unschedulablePods, pInfo)
    }
  }
  p.movePodsToActiveOrBackoffQueue(unschedulablePods, event)
}

如果上述的Precheck通過后，則會把Pod移到相應(yīng)的隊列等待下一次調(diào)度。這里的重點來了，本文是講關(guān)于資源相關(guān)的，那么Precheck中到底做了什么檢查呢？

func preCheckForNode(nodeInfo *framework.NodeInfo) queue.PreEnqueueCheck {
  // Note: the following checks doesn't take preemption into considerations, in very rare
  // cases (e.g., node resizing), "pod" may still fail a check but preemption helps. We deliberately
  // chose to ignore those cases as unschedulable pods will be re-queued eventually.
  return func(pod *v1.Pod) bool {
    admissionResults := AdmissionCheck(pod, nodeInfo, false)
    if len(admissionResults) != 0 {
      return false
    }
    _, isUntolerated := corev1helpers.FindMatchingUntoleratedTaint(nodeInfo.Node().Spec.Taints, pod.Spec.Tolerations, func(t *v1.Taint) bool {
      return t.Effect == v1.TaintEffectNoSchedule
    })
    return !isUntolerated
  }
}
func AdmissionCheck(pod *v1.Pod, nodeInfo *framework.NodeInfo, includeAllFailures bool) []AdmissionResult {
  var admissionResults []AdmissionResult
  insufficientResources := noderesources.Fits(pod, nodeInfo)
  if len(insufficientResources) != 0 {
    for i := range insufficientResources {
      admissionResults = append(admissionResults, AdmissionResult{InsufficientResource: &insufficientResources[i]})
    }
    if !includeAllFailures {
      return admissionResults
    }
  }
  if matches, _ := corev1nodeaffinity.GetRequiredNodeAffinity(pod).Match(nodeInfo.Node()); !matches {
    admissionResults = append(admissionResults, AdmissionResult{Name: nodeaffinity.Name, Reason: nodeaffinity.ErrReasonPod})
    if !includeAllFailures {
      return admissionResults
    }
}
  if !nodename.Fits(pod, nodeInfo) {
    admissionResults = append(admissionResults, AdmissionResult{Name: nodename.Name, Reason: nodename.ErrReason})
    if !includeAllFailures {
      return admissionResults
    }
  }
  if !nodeports.Fits(pod, nodeInfo) {
    admissionResults = append(admissionResults, AdmissionResult{Name: nodeports.Name, Reason: nodeports.ErrReason})
    if !includeAllFailures {
      return admissionResults
    }
  }
  return admissionResults
}

preCheckForNode 調(diào)用了AdmissionCheck，在AdmissionCheck中分別做了資源檢查、親和性檢查、nodeName檢查、端口檢查。這里我們只關(guān)注資源的檢查

func Fits(pod *v1.Pod, nodeInfo *framework.NodeInfo) []InsufficientResource {
  return fitsRequest(computePodResourceRequest(pod), nodeInfo, nil, nil)
}
func fitsRequest(podRequest *preFilterState, nodeInfo *framework.NodeInfo, ignoredExtendedResources, ignoredResourceGroups sets.String) []InsufficientResource {
  insufficientResources := make([]InsufficientResource, 0, 4)
  allowedPodNumber := nodeInfo.Allocatable.AllowedPodNumber
  if len(nodeInfo.Pods)+1 > allowedPodNumber {
    insufficientResources = append(insufficientResources, InsufficientResource{
      ResourceName: v1.ResourcePods,
      Reason:       "Too many pods",
      Requested:    1,
      Used:         int64(len(nodeInfo.Pods)),
      Capacity:     int64(allowedPodNumber),
    })
  }
  if podRequest.MilliCPU == 0 &&
    podRequest.Memory == 0 &&
    podRequest.EphemeralStorage == 0 &&
    len(podRequest.ScalarResources) == 0 {
    return insufficientResources
  }
  if podRequest.MilliCPU > (nodeInfo.Allocatable.MilliCPU - nodeInfo.Requested.MilliCPU) {
    insufficientResources = append(insufficientResources, InsufficientResource{
      ResourceName: v1.ResourceCPU,
      Reason:       "Insufficient cpu",
      Requested:    podRequest.MilliCPU,
      Used:         nodeInfo.Requested.MilliCPU,
      Capacity:     nodeInfo.Allocatable.MilliCPU,
    })
  }
  if podRequest.Memory > (nodeInfo.Allocatable.Memory - nodeInfo.Requested.Memory) {
    insufficientResources = append(insufficientResources, InsufficientResource{
      ResourceName: v1.ResourceMemory,
      Reason:       "Insufficient memory",
      Requested:    podRequest.Memory,
      Used:         nodeInfo.Requested.Memory,
      Capacity:     nodeInfo.Allocatable.Memory,
    })
}
  if podRequest.EphemeralStorage > (nodeInfo.Allocatable.EphemeralStorage - nodeInfo.Requested.EphemeralStorage) {
    insufficientResources = append(insufficientResources, InsufficientResource{
      ResourceName: v1.ResourceEphemeralStorage,
      Reason:       "Insufficient ephemeral-storage",
      Requested:    podRequest.EphemeralStorage,
      Used:         nodeInfo.Requested.EphemeralStorage,
      Capacity:     nodeInfo.Allocatable.EphemeralStorage,
    })
  }
  for rName, rQuant := range podRequest.ScalarResources {
    if v1helper.IsExtendedResourceName(rName) {
      // If this resource is one of the extended resources that should be ignored, we will skip checking it.
      // rName is guaranteed to have a slash due to API validation.
      var rNamePrefix string
      if ignoredResourceGroups.Len() > 0 {
        rNamePrefix = strings.Split(string(rName), "/")[0]
      }
      if ignoredExtendedResources.Has(string(rName)) || ignoredResourceGroups.Has(rNamePrefix) {
        continue
      }
    }
    if rQuant > (nodeInfo.Allocatable.ScalarResources[rName] - nodeInfo.Requested.ScalarResources[rName]) {
      insufficientResources = append(insufficientResources, InsufficientResource{
        ResourceName: rName,
        Reason:       fmt.Sprintf("Insufficient %v", rName),
        Requested:    podRequest.ScalarResources[rName],
        Used:         nodeInfo.Requested.ScalarResources[rName],
        Capacity:     nodeInfo.Allocatable.ScalarResources[rName],
      })
    }
  }
  return insufficientResources
}

fitsRequest 首先會調(diào)用computePodResourceRequest函數(shù)計算出這個Pod需要多少資源，然后跟目前節(jié)點還能分配的資源做比較，如果還能夠分配出資源，那么針對于資源檢查這一項就通過了。如果Precheck所有項都能通過，那么該Pod會被放入active隊列，該隊列里的Pod就會被kube-scheduler取出做調(diào)度。前面所說的目前節(jié)點資源情況是從哪里來的呢？kubelet會定期（或者node發(fā)生變化）上報心跳到Kube-apiserver，因為kube-scheduler監(jiān)聽了node的變化，所以能感知到節(jié)點的資源使用情況。

當(dāng)Kube-scheduler從隊列取到Pod后，會進行一系列的判斷（如PreFilter），還會涉及資源的檢查，這個資源使用情況也是kubelet上報的。

當(dāng)我們describe 一個node的時候，可以看到能夠顯示資源allocatable的信息，那這個信息就是實時的資源使用情況嗎？答案是否定的，我們看下

kubectl describe node xxxx
Capacity:
 cpu:                64
 ephemeral-storage:  1056889268Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             263042696Ki
 pods:               110
Allocatable:
 cpu:                63
 ephemeral-storage:  1044306356Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             257799816Ki
 pods:               110

注：本機是64U256G的機器

其中capacity是本機的硬件一共可以提供的資源，allocatable是可以分配的，那么為什么allocatable為什么會和capacity不一樣呢？這里就涉及到了預(yù)留資源和驅(qū)逐相關(guān)內(nèi)容

下kubelet相關(guān)配置：**

systemReserved:
  cpu: "0.5"
  ephemeral-storage: 1Gi
  memory: 2Gi
  pid: "1000"
kubeReserved:
  cpu: "0.5"
  ephemeral-storage: 1Gi
  memory: 2Gi
  pid: "1000
evictionHard:
  imagefs.available: 10Gi
  memory.available: 1Gi
  nodefs.available: 10Gi
  nodefs.inodesFree: 5%

systemReserved， kubeReserved分別 *表示預(yù)留給操作系統(tǒng)和kubernetes組件的資源，kubelet在上報可用資源的時候需要減去這部分資源；evictionHard表示資源只剩下這么多的時候，就會啟動Pod的驅(qū)逐，所以這部分資源也不能算在可分配里面的。這么算起來，上面的capacity減去上述三者相加正好是allocatable的值，也就是該節(jié)點實際可分配的資源。他們的關(guān)系可以用下圖表示
*

但是這里有個值得注意的點，上面通過describe出來的allocatable的值是一個靜態(tài)的值，表示該節(jié)點總共可以分配多少資源，而不是此時此刻節(jié)點可以分配多少資源，kube-scheduler依據(jù)Kubelet動態(tài)上報的數(shù)據(jù)來判斷某個節(jié)點是否能夠調(diào)度。

還需要注意，要使systemReserved， kubeReserved配置的資源不算在可分配的資源里面，還需要配置如下配置：

# 該配置表示，capacity減去下面的配置的資源才是節(jié)點d當(dāng)前可分配的
# 默認是pods，表示只減去pods占用了的資源
enforceNodeAllocatable:
- pods
- kube-reserved
- system-reserved
# 如果你使用的是systemd作為cgroup驅(qū)動，你還需要配置下面的配置
# 否則kubelet無法正常啟動，因為找不到cgroup目錄
# k8s官方推薦s會用systemd
kubeReservedCgroup: /kubelet.slice
systemReservedCgroup: /system.slice/kubelet.service

到這里，我們就簡單講了scheduler是如何在調(diào)度的時候，在資源層面是如何判斷的。當(dāng)然了，上面只簡單講了調(diào)度的時候pod被移到可調(diào)度隊列的情況，后面還有prefilter、filter、score等步驟，但是這些步驟在判斷資源情況時，跟上面是一樣的。

以上就是一文詳解kubernetes 中資源分配的那些事的詳細內(nèi)容，更多關(guān)于kubernetes 資源分配的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: