IoT?邊緣集群Kubernetes?Events告警通知進(jìn)一步配置詳解
目標(biāo)
上一篇文章
IoT 邊緣集群基于 Kubernetes Events 的告警通知實(shí)現(xiàn)
告警恢復(fù)通知 - 經(jīng)過評估無法實(shí)現(xiàn)
原因: 告警和恢復(fù)是單獨(dú)完全不相關(guān)的事件, 告警是 Warning
級別, 恢復(fù)是 Normal
級別, 要開啟恢復(fù), 就會導(dǎo)致所有 Normal
Events 都會被發(fā)送, 這個(gè)數(shù)量是很恐怖的; 而且, 除非特別有經(jīng)驗(yàn)和耐心, 否則無法看出哪條 Normal
對應(yīng)的是 告警的恢復(fù).
- 未恢復(fù)進(jìn)行持續(xù)告警 - 默認(rèn)就帶的能力, 無需額外配置.
- 告警內(nèi)容顯示資源名稱,比如節(jié)點(diǎn)和pod名稱
可以設(shè)置屏蔽特定的節(jié)點(diǎn)和工作負(fù)載并可以動(dòng)態(tài)調(diào)整
比如,集群001
中的節(jié)點(diǎn)worker-1
做計(jì)劃性維護(hù),期間停止監(jiān)控,維護(hù)完成后重新開始監(jiān)控。
配置
告警內(nèi)容顯示資源名稱
典型的幾類 events:
apiVersion: v1 count: 101557 eventTime: null firstTimestamp: "2022-04-08T03:50:47Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{prometheus} kind: Pod name: prometheus-rancher-monitoring-prometheus-0 namespace: cattle-monitoring-system kind: Event lastTimestamp: "2022-04-14T11:39:19Z" message: 'Readiness probe failed: Get "http://10.42.0.87:9090/-/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)' metadata: creationTimestamp: "2022-04-08T03:51:17Z" name: prometheus-rancher-monitoring-prometheus-0.16e3cf53f0793344 namespace: cattle-monitoring-system reason: Unhealthy reportingComponent: "" reportingInstance: "" source: component: kubelet host: master-1 type: Warning
apiVersion: v1 count: 116 eventTime: null firstTimestamp: "2022-04-13T02:43:26Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{grafana} kind: Pod name: rancher-monitoring-grafana-57777cc795-2b2x5 namespace: cattle-monitoring-system kind: Event lastTimestamp: "2022-04-14T11:18:56Z" message: 'Readiness probe failed: Get "http://10.42.0.90:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)' metadata: creationTimestamp: "2022-04-14T11:18:57Z" name: rancher-monitoring-grafana-57777cc795-2b2x5.16e5548dd2523a13 namespace: cattle-monitoring-system reason: Unhealthy reportingComponent: "" reportingInstance: "" source: component: kubelet host: master-1 type: Warning
apiVersion: v1 count: 20958 eventTime: null firstTimestamp: "2022-04-11T10:34:51Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{lb-port-1883} kind: Pod name: svclb-emqx-dt22t namespace: emqx kind: Event lastTimestamp: "2022-04-14T11:39:48Z" message: Back-off restarting failed container metadata: creationTimestamp: "2022-04-11T10:34:51Z" name: svclb-emqx-dt22t.16e4d11e2b9efd27 namespace: emqx reason: BackOff reportingComponent: "" reportingInstance: "" source: component: kubelet host: worker-1 type: Warning
apiVersion: v1 count: 21069 eventTime: null firstTimestamp: "2022-04-11T10:34:48Z" involvedObject: apiVersion: v1 fieldPath: spec.containers{lb-port-80} kind: Pod name: svclb-traefik-r5p8t namespace: kube-system kind: Event lastTimestamp: "2022-04-14T11:44:59Z" message: Back-off restarting failed container metadata: creationTimestamp: "2022-04-11T10:34:48Z" name: svclb-traefik-r5p8t.16e4d11daf0b79ce namespace: kube-system reason: BackOff reportingComponent: "" reportingInstance: "" source: component: kubelet host: worker-1 type: Warning
{ "metadata": { "name": "event-exporter-79544df9f7-xj4t5.16e5c540dc32614f", "namespace": "monitoring", "uid": "baf2f642-2383-4e22-87e0-456b6c3eaf4e", "resourceVersion": "14043444", "creationTimestamp": "2022-04-14T13:08:40Z" }, "reason": "Pulled", "message": "Container image \"ghcr.io/opsgenie/kubernetes-event-exporter:v0.11\" already present on machine", "source": { "component": "kubelet", "host": "worker-2" }, "firstTimestamp": "2022-04-14T13:08:40Z", "lastTimestamp": "2022-04-14T13:08:40Z", "count": 1, "type": "Normal", "eventTime": null, "reportingComponent": "", "reportingInstance": "", "involvedObject": { "kind": "Pod", "namespace": "monitoring", "name": "event-exporter-79544df9f7-xj4t5", "uid": "b77d3e13-fa9e-484b-8a5a-d1afc9edec75", "apiVersion": "v1", "resourceVersion": "14043435", "fieldPath": "spec.containers{event-exporter}", "labels": { "app": "event-exporter", "pod-template-hash": "79544df9f7", "version": "v1" } } }
我們可以把更多的字段加入到告警信息中, 其中就包括:
- 節(jié)點(diǎn):
{{ Source.Host }}
- Pod:
{{ .InvolvedObject.Name }}
綜上, 修改后的event-exporter-cfg
yaml 如下:
apiVersion: v1 kind: ConfigMap metadata: name: event-exporter-cfg namespace: monitoring resourceVersion: '5779968' data: config.yaml: | logLevel: error logFormat: json route: routes: - match: - receiver: "dump" - drop: - type: "Normal" match: - receiver: "feishu" receivers: - name: "dump" stdout: {} - name: "feishu" webhook: endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..." headers: Content-Type: application/json layout: msg_type: interactive card: config: wide_screen_mode: true enable_forward: true header: title: tag: plain_text content: xxx測試K3S集群告警 template: red elements: - tag: div text: tag: lark_md content: "**EventID:** {{ .UID }}\n**EventNamespace:** {{ .InvolvedObject.Namespace }}\n**EventName:** {{ .InvolvedObject.Name }}\n**EventType:** {{ .Type }}\n**EventKind:** {{ .InvolvedObject.Kind }}\n**EventReason:** {{ .Reason }}\n**EventTime:** {{ .LastTimestamp }}\n**EventMessage:** {{ .Message }}\n**EventComponent:** {{ .Source.Component }}\n**EventHost:** {{ .Source.Host }}\n**EventLabels:** {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{ toJson .InvolvedObject.Annotations}}"
屏蔽特定的節(jié)點(diǎn)和工作負(fù)載
比如,集群001
中的節(jié)點(diǎn)worker-1
做計(jì)劃性維護(hù),期間停止監(jiān)控,維護(hù)完成后重新開始監(jiān)控。
繼續(xù)修改event-exporter-cfg
yaml 如下:
apiVersion: v1 kind: ConfigMap metadata: name: event-exporter-cfg namespace: monitoring data: config.yaml: | logLevel: error logFormat: json route: routes: - match: - receiver: "dump" - drop: - type: "Normal" - source: host: "worker-1" - namespace: "cattle-monitoring-system" - name: "*emqx*" - kind: "Pod|Deployment|ReplicaSet" - labels: version: "dev" match: - receiver: "feishu" receivers: - name: "dump" stdout: {} - name: "feishu" webhook: endpoint: "https://open.feishu.cn/open-apis/bot/v2/hook/..." headers: Content-Type: application/json layout: msg_type: interactive card: config: wide_screen_mode: true enable_forward: true header: title: tag: plain_text content: xxx測試K3S集群告警 template: red elements: - tag: div text: tag: lark_md content: "**EventID:** {{ .UID }}\n**EventNamespace:** {{ .InvolvedObject.Namespace }}\n**EventName:** {{ .InvolvedObject.Name }}\n**EventType:** {{ .Type }}\n**EventKind:** {{ .InvolvedObject.Kind }}\n**EventReason:** {{ .Reason }}\n**EventTime:** {{ .LastTimestamp }}\n**EventMessage:** {{ .Message }}\n**EventComponent:** {{ .Source.Component }}\n**EventHost:** {{ .Source.Host }}\n**EventLabels:** {{ toJson .InvolvedObject.Labels}}\n**EventAnnotations:** {{ toJson .InvolvedObject.Annotations}}"
默認(rèn)的 drop 規(guī)則為: - type: "Normal"
, 即不對 Normal
級別進(jìn)行告警;
現(xiàn)在加入以下規(guī)則:
- source: host: "worker-1" - namespace: "cattle-monitoring-system" - name: "*emqx*" - kind: "Pod|Deployment|ReplicaSet" - labels: version: "dev"
... host: "worker-1"
: 不對節(jié)點(diǎn)worker-1
做告警;... namespace: "cattle-monitoring-system"
: 不對 NameSpace:cattle-monitoring-system
做告警;... name: "*emqx*"
: 不對 name(name 往往是 pod name) 包含emqx
的做告警kind: "Pod|Deployment|ReplicaSet"
: 不對Pod
Deployment
ReplicaSet
做告警(也就是不關(guān)注應(yīng)用, 組件相關(guān)的告警)...version: "dev"
: 不對label
含有version: "dev"
的做告警(可以通過它屏蔽特定的應(yīng)用的告警)
最終效果
如下圖:
以上就是IoT 邊緣集群Kubernetes Events告警通知進(jìn)一步配置詳解的詳細(xì)內(nèi)容,更多關(guān)于IoT Kubernetes Events告警的資料請關(guān)注腳本之家其它相關(guān)文章!
- 詳解Kubernetes 中容器跨主機(jī)網(wǎng)絡(luò)
- Kubernetes?Ingress實(shí)現(xiàn)細(xì)粒度IP訪問控制
- Kubernetes如何限制不同團(tuán)隊(duì)只能訪問各自namespace實(shí)現(xiàn)
- 詳解Rainbond云原生平臺簡化Kubernetes業(yè)務(wù)問題排查
- 一文解析Kubernetes使用PVC后數(shù)據(jù)丟失
- Kubernetes上使用Jaeger分布式追蹤基礎(chǔ)設(shè)施詳解
- IoT邊緣集群Kubernetes?Events告警通知實(shí)現(xiàn)示例
- kubernetes之statefulset搭建MySQL集群
相關(guān)文章
教你在k8s上部署HADOOP-3.2.2(HDFS)的方法
這篇文章主要介紹了k8s-部署HADOOP-3.2.2(HDFS)的方法,本文給大家介紹的非常詳細(xì),對大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2022-04-04Kubernetes應(yīng)用配置管理創(chuàng)建使用詳解
這篇文章主要為大家介紹了Kubernetes應(yīng)用配置管理創(chuàng)建使用詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-11-11Rainbond對微服務(wù)進(jìn)行請求速率限制詳解
這篇文章主要為大家介紹了Rainbond對微服務(wù)進(jìn)行請求速率限制,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-04-04Istio 自動(dòng)注入 sidecar 失敗導(dǎo)致無法訪問webhook服務(wù)的解決方法
最近工作中在部署Istio環(huán)境的過程中發(fā)現(xiàn)官方示例啟動(dòng)的pod不能訪問不到Istio的webhook,這個(gè)問題也是困擾了我一天,我把他歸類到sidecar注入失敗的情況下,本文給大家分享問題解決方法,感興趣的朋友跟隨小編一起看看吧2023-10-10Kubernetes存儲系統(tǒng)數(shù)據(jù)持久化管理詳解
這篇文章主要為大家介紹了Kubernetes存儲系統(tǒng)數(shù)據(jù)持久化管理詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-11-11CentOS 7.9 升級內(nèi)核 kernel-ml-5.6.14版本的方法
這篇文章主要介紹了CentOS 7.9 升級內(nèi)核 kernel-ml-5.6.14版本,默認(rèn)內(nèi)核版本為3.10.0,現(xiàn)升級到 5.6.14 版本,本文給大家介紹的非常詳細(xì),對大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2022-10-10關(guān)于CentOS7日志文件及journalctl日志查看方法
這篇文章主要介紹了關(guān)于CentOS7日志文件及journalctl日志查看方法,具有很好的參考價(jià)值,希望對大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-03-03