K8S容器OOM?killed排查過(guò)程
背景
數(shù)據(jù)服務(wù)平臺(tái)南海容器k8s設(shè)置的內(nèi)存上限2GB,多次容器被OOM killed。
啟動(dòng)命令
java -XX:MaxRAMPercentage=70.0 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/apps/logs/ ***.jar
排查過(guò)程
1 當(dāng)收到實(shí)例內(nèi)存超過(guò)95%告警時(shí),把jvm進(jìn)程堆dump下來(lái)后,用visual vm分析堆內(nèi)存,并未發(fā)現(xiàn)內(nèi)存泄漏。推測(cè)進(jìn)程就需要花較多的內(nèi)存,是內(nèi)存分配不夠。遂將內(nèi)存增加到4GB。繼續(xù)觀察
2 南海和順德docker實(shí)例依然OOM killed。 當(dāng)實(shí)例內(nèi)存超過(guò)95%時(shí),dump出堆內(nèi)存并分析,依然沒(méi)有發(fā)現(xiàn)內(nèi)存泄漏,比較正常。
3 懷疑是容器內(nèi)部除了java的其他進(jìn)程耗用了容器內(nèi)存。當(dāng)實(shí)例內(nèi)存超過(guò)95%時(shí),對(duì)比top顯示的的jvm進(jìn)程內(nèi)存和ps stats輸出的docker實(shí)例內(nèi)存信息,其余進(jìn)程耗用的內(nèi)存忽略不計(jì)。
4 由于堆內(nèi)存沒(méi)有的到達(dá)上限,但是整個(gè)jvm進(jìn)程內(nèi)存超出了容器的內(nèi)存限制。因此推測(cè)是對(duì)外內(nèi)存(本地內(nèi)存,棧內(nèi)存等,元數(shù)據(jù)空間等)耗用較大,執(zhí)行命令
/****/jcmd 1 VM.native_memory
VM.native_memory特性并未開(kāi)啟。
5 觀察到一個(gè)現(xiàn)象,docker進(jìn)程被oom killed之前,java應(yīng)用堆內(nèi)存并沒(méi)有被Full gc。并且堆內(nèi)存沒(méi)有用到上限值2.8GB(4 * 0.7)。docker是go語(yǔ)言編寫(xiě),并沒(méi)有GC的能力。docker耗用完內(nèi)存前,堆內(nèi)存并沒(méi)有達(dá)到上限,于是沒(méi)有觸發(fā)老年代GC,內(nèi)存沒(méi)有降下去。當(dāng)堆內(nèi)存不夠的時(shí)候,依然會(huì)找docker容器申請(qǐng)內(nèi)存。
6 修改jvm配置,將南海的MaxRAMPercentage降到60, 南海分組的堆內(nèi)存上限變成2.4GB(4 * 0.6),順德分組不變。并增加 -XX:NativeMemoryTracking=summary
配 置。8.18日重啟所有實(shí)例使新增的配置生效。觀察一段時(shí)間
發(fā)現(xiàn)南海分組的full gc更加頻繁,繼續(xù)觀察
7 后面團(tuán)隊(duì)其他實(shí)例OOM killed,登陸實(shí)例宿主機(jī),查看/var/log/messages文件, 定位到發(fā)生時(shí)刻的日志,發(fā)現(xiàn)容器的內(nèi)存耗用達(dá)到了實(shí)例的限制值
Apr 7 11:09:47 mvxl6778 kernel: VM Thread invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=-998 Apr 7 11:09:47 mvxl6778 kernel: VM Thread cpuset=c822c3d382c5db25d7f025cd46c18a9990cb1dc4e9ea5463cce595d74a4fa97b mems_allowed=0 Apr 7 11:09:47 mvxl6778 kernel: CPU: 0 PID: 67347 Comm: VM Thread Not tainted 4.4.234-1.el7.elrepo.x86_64 #1 Apr 7 11:09:47 mvxl6778 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 Apr 7 11:09:47 mvxl6778 kernel: 0000000000000286 4e798947c7beb468 ffff88000a20bc80 ffffffff8134ee3a Apr 7 11:09:47 mvxl6778 kernel: ffff88000a20bd58 ffff8800adb68000 ffff88000a20bce8 ffffffff8121192b Apr 7 11:09:47 mvxl6778 kernel: ffffffff8119997c ffff8801621b2a00 0000000000000000 0000000000000206 Apr 7 11:09:47 mvxl6778 kernel: Call Trace: Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff8134ee3a>] dump_stack+0x6d/0x93 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff8121192b>] dump_header+0x57/0x1bb Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff8119997c>] ? find_lock_task_mm+0x3c/0x80 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81211a9d>] oom_kill_process.cold+0xe/0x30e Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff812068e6>] ? mem_cgroup_iter+0x146/0x320 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81208a18>] mem_cgroup_out_of_memory+0x2c8/0x310 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81209653>] mem_cgroup_oom_synchronize+0x2e3/0x310 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81204900>] ? get_mctgt_type+0x250/0x250 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff8119a31e>] pagefault_out_of_memory+0x3e/0xb0 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81067882>] mm_fault_error+0x62/0x150 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81068108>] __do_page_fault+0x3d8/0x3e0 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81068142>] do_page_fault+0x32/0x70 Apr 7 11:09:47 mvxl6778 kernel: [<ffffffff81736b08>] page_fault+0x28/0x30 Apr 7 11:09:47 mvxl6778 kernel: Task in /kubepods/podb874ea27-0b15-4afc-8500-8a62bd73cf3c/c822c3d382c5db25d7f025cd46c18a9990cb1dc4e9ea5463cce595d74a4fa97b killed as a result of limit of /kubepods/podb874ea27-0b15-4afc-8500-8a62bd73cf3c Apr 7 11:09:47 mvxl6778 kernel: memory: usage 2048000kB, limit 2048000kB, failcnt 141268 Apr 7 11:09:47 mvxl6778 kernel: memory+swap: usage 2048000kB, limit 9007199254740988kB, failcnt 0 Apr 7 11:09:47 mvxl6778 kernel: kmem: usage 14208kB, limit 9007199254740988kB, failcnt 0 Apr 7 11:09:47 mvxl6778 kernel: Memory cgroup stats for /kubepods/podb874ea27-0b15-4afc-8500-8a62bd73cf3c: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Apr 7 11:09:47 mvxl6778 kernel: Memory cgroup stats for /kubepods/podb874ea27-0b15-4afc-8500-8a62bd73cf3c/48bcfc65def8e5848caeeaf5f50ed6205b41c2da1b7ccddc7d1133ef7859373e: cache:0KB rss:2320KB rss_huge:2048KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:2320KB inactive_file:0KB active_file:0KB unevictable:0KB Apr 7 11:09:47 mvxl6778 kernel: Memory cgroup stats for /kubepods/podb874ea27-0b15-4afc-8500-8a62bd73cf3c/c822c3d382c5db25d7f025cd46c18a9990cb1dc4e9ea5463cce595d74a4fa97b: cache:0KB rss:2031472KB rss_huge:1589248KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:2031444KB inactive_file:0KB active_file:0KB unevictable:0KB Apr 7 11:09:47 mvxl6778 kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name Apr 7 11:09:47 mvxl6778 kernel: [66952] 0 66952 66334 616 6 5 0 -998 pause Apr 7 11:09:47 mvxl6778 kernel: [67231] 1000 67231 1906848 510149 1157 11 0 -998 java Apr 7 11:09:47 mvxl6778 kernel: Memory cgroup out of memory: Kill process 67231 (java) score 0 or sacrifice child Apr 7 11:09:48 mvxl6778 kernel: Killed process 67231 (java) total-vm:7627392kB, anon-rss:2023032kB, file-rss:17564kB Apr 7 11:09:49 mvxl6778 kubelet: E0407 11:09:49.750842 42238 kubelet_volumes.go:154] orphaned pod "03b9c32c-e356-4547-8db8-35c91fee1666" found, but volume subpaths are still present on disk : There were a total of 5 errors similar to this. Turn up verbosity to see them.
8 執(zhí)行dmesg命令
[27837536.622655] Task in /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/053c6ef5780bce8c1be542c38470d065d1cfb3630b81bc739d8990a5289fec33 killed as a result of limit of /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801 [27837536.622664] memory: usage 3072000kB, limit 3072000kB, failcnt 49 [27837536.622665] memory+swap: usage 3072000kB, limit 9007199254740988kB, failcnt 0 [27837536.622667] kmem: usage 15968kB, limit 9007199254740988kB, failcnt 0 [27837536.622668] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB [27837536.622730] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/3ecc1d2e0e14b13aafdeaf57432657154d9c23c6bba0d14b5458f7e226670bb6: cache:0KB rss:2284KB rss_huge:2048KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:2284KB inactive_file:0KB active_file:0KB unevictable:0KB [27837536.622784] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/053c6ef5780bce8c1be542c38470d065d1cfb3630b81bc739d8990a5289fec33: cache:36KB rss:3053712KB rss_huge:1443840KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:3053712KB inactive_file:4KB active_file:4KB unevictable:0KB [27837536.622831] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [27837536.622971] [113432] 0 113347 66334 582 7 5 0 -998 pause [27837536.622979] [113478] 1000 113478 2231341 761114 1655 12 0 -998 java [27837536.622982] [114249] 1000 114249 3780 130 13 3 0 -998 sh [27837536.622984] [114255] 1000 114255 3815 196 14 3 0 -998 bash [27837536.622987] [114430] 1000 114430 3780 135 12 3 0 -998 sh [27837536.622990] [114438] 1000 114438 3815 205 13 3 0 -998 bash [27837536.622995] [78138] 1000 78138 3780 103 13 3 0 -998 sh [27837536.622998] [78143] 1000 78143 3815 202 13 3 0 -998 bash [27837536.623000] [27969] 1000 27969 3780 131 12 3 0 -998 sh [27837536.623002] [27992] 1000 27992 3815 203 13 3 0 -998 bash [27837536.623025] Memory cgroup out of memory: Kill process 113478 (java) score 0 or sacrifice child [27837536.625233] Killed process 113478 (java) total-vm:8925364kB, anon-rss:3041072kB, file-rss:3384kB [27838779.764721] VM Thread invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=-998 [27838779.764724] VM Thread cpuset=ad1ed05ee29bbaba6b3d8b464ce8cd3ee8aeb028fbd3ffd14566d8a25824fe47 mems_allowed=0 [27838779.764731] CPU: 5 PID: 108037 Comm: VM Thread Not tainted 4.4.234-1.el7.elrepo.x86_64 #1 [27838779.764732] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 [27838779.764734] 0000000000000286 437aa04ae9c61e69 ffff880267fdfc80 ffffffff8134ee3a [27838779.764736] ffff880267fdfd58 ffff880136120000 ffff880267fdfce8 ffffffff8121192b [27838779.764738] ffffffff8119997c ffff8803b8012a00 0000000000000000 0000000000000206 [27838779.764740] Call Trace: [27838779.764748] [<ffffffff8134ee3a>] dump_stack+0x6d/0x93 [27838779.764752] [<ffffffff8121192b>] dump_header+0x57/0x1bb [27838779.764757] [<ffffffff8119997c>] ? find_lock_task_mm+0x3c/0x80 [27838779.764759] [<ffffffff81211a9d>] oom_kill_process.cold+0xe/0x30e [27838779.764763] [<ffffffff812068e6>] ? mem_cgroup_iter+0x146/0x320 [27838779.764765] [<ffffffff81208a18>] mem_cgroup_out_of_memory+0x2c8/0x310 [27838779.764767] [<ffffffff81209653>] mem_cgroup_oom_synchronize+0x2e3/0x310 [27838779.764769] [<ffffffff81204900>] ? get_mctgt_type+0x250/0x250 [27838779.764771] [<ffffffff8119a31e>] pagefault_out_of_memory+0x3e/0xb0 [27838779.764775] [<ffffffff81067882>] mm_fault_error+0x62/0x150 [27838779.764776] [<ffffffff81068108>] __do_page_fault+0x3d8/0x3e0 [27838779.764778] [<ffffffff81068142>] do_page_fault+0x32/0x70 [27838779.764782] [<ffffffff81736b08>] page_fault+0x28/0x30 [27838779.764783] Task in /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/ad1ed05ee29bbaba6b3d8b464ce8cd3ee8aeb028fbd3ffd14566d8a25824fe47 killed as a result of limit of /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801 [27838779.764788] memory: usage 3072000kB, limit 3072000kB, failcnt 32457 [27838779.764789] memory+swap: usage 3072000kB, limit 9007199254740988kB, failcnt 0 [27838779.764790] kmem: usage 15332kB, limit 9007199254740988kB, failcnt 0 [27838779.764791] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB [27838779.764839] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/053c6ef5780bce8c1be542c38470d065d1cfb3630b81bc739d8990a5289fec33: cache:4KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:4KB unevictable:0KB [27838779.764866] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/6f518a65cbfb4f157a760f96b8562da00d1848918432cc622575a022070819f9: cache:0KB rss:292KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:292KB inactive_file:0KB active_file:0KB unevictable:0KB [27838779.764902] Memory cgroup stats for /kubepods/podc3a7bd5b-6a9a-41f2-b42c-ef1889203801/ad1ed05ee29bbaba6b3d8b464ce8cd3ee8aeb028fbd3ffd14566d8a25824fe47: cache:4KB rss:3056368KB rss_huge:1458176KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:3056368KB inactive_file:0KB active_file:4KB unevictable:0KB [27838779.764933] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [27838779.765060] [107843] 0 107843 66334 120 6 5 0 -998 pause [27838779.765063] [107964] 1000 107964 2174971 765281 1667 12 0 -998 java [27838779.765070] Memory cgroup out of memory: Kill process 107964 (java) score 0 or sacrifice child [27838779.766467] Killed process 107964 (java) total-vm:8699884kB, anon-rss:3043596kB, file-rss:17528kB
結(jié)論
如果容器OOM killed,容器里的jvm進(jìn)程沒(méi)有Full GC,那么肯定是MaxRAMPercentage參數(shù)太高,導(dǎo)致堆內(nèi)存沒(méi)有用到上限,無(wú)法觸發(fā)堆內(nèi)存(老年代)GC。但此時(shí)java使用的內(nèi)存超過(guò)了容器的限制的內(nèi)存閾值,最終被容器kill.
除了堆區(qū),元數(shù)據(jù)區(qū)和直接內(nèi)存也可能導(dǎo)致jvm進(jìn)程使用的內(nèi)存找過(guò)容器的限制值。大家注意設(shè)置好。
jvm進(jìn)程本身能否檢測(cè)到cgroups技術(shù)的限制值?目前來(lái)看我在用的1.8版本是不行的。
解決辦法
這個(gè)情況下就需要把MaxRAMPercentage參數(shù)適當(dāng)調(diào)低出發(fā)對(duì)內(nèi)存的gc回收,或者把實(shí)例的內(nèi)存閾值調(diào)高些。
寫(xiě)在最后
以上為個(gè)人經(jīng)驗(yàn),希望能給大家一個(gè)參考,也希望大家多多支持腳本之家。
相關(guān)文章
Hadoop 2.x與3.x 22點(diǎn)比較,Hadoop 3.x比2.x的改進(jìn)
本文介紹了Hadoop3版本中添加的新功能,Hadoop 2和Hadoop 3的區(qū)別,在這篇文章中,我們將討論Hadoop 2.x與Hadoop 3.x之間的比較。感興趣的朋友跟隨小編一起看一下2018-09-09K8s準(zhǔn)入控制Admission?Controller深入介紹
本篇我們將聚焦于?kube-apiserver?請(qǐng)求處理過(guò)程中一個(gè)很重要的部分?--?準(zhǔn)入控制器(Admission?Controller)深入講解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步早日升職加薪2022-04-04玩客云安裝青龍面板實(shí)現(xiàn)京東簽到薅羊毛功能
這篇文章主要介紹了玩客云安裝青龍面板實(shí)現(xiàn)京東簽到薅羊毛,本人準(zhǔn)備的服務(wù)器就是玩客云,只需運(yùn)行一些常用的?docker?容器就行,需要的朋友可以參考下2022-05-05kubernetes?Volume存儲(chǔ)卷configMap學(xué)習(xí)筆記
這篇文章主要為大家介紹了kubernetes?Volume存儲(chǔ)卷configMap學(xué)習(xí)筆記,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-05-05K8s中Pod處于Pending狀態(tài)的八種原因分析
文章詳細(xì)介紹了Pod處于Pending狀態(tài)的八種常見(jiàn)原因,并提供了相應(yīng)的排查和解決方法,這些原因包括資源不足、調(diào)度約束、存儲(chǔ)依賴(lài)、鏡像問(wèn)題、配額限制、網(wǎng)絡(luò)暗礁、系統(tǒng)級(jí)異常以及冷門(mén)陷阱,每種原因都附帶了具體的診斷方法和解決建議,感興趣的朋友一起看看吧2025-02-02CentOS 7.9 升級(jí)內(nèi)核 kernel-ml-5.6.14版本的方法
這篇文章主要介紹了CentOS 7.9 升級(jí)內(nèi)核 kernel-ml-5.6.14版本,默認(rèn)內(nèi)核版本為3.10.0,現(xiàn)升級(jí)到 5.6.14 版本,本文給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2022-10-10一文解析Kubernetes使用PVC后數(shù)據(jù)丟失
這篇文章主要為大家介紹了Kubernetes使用PVC后數(shù)據(jù)丟失原理解析,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2023-03-03