[{"data":1,"prerenderedAt":1407},["ShallowReactive",2],{"blog-en-prometheus-monitoring-kubernetes-guide":3,"blog-en-prometheus-monitoring-kubernetes-guide-alt":522},{"id":4,"title":5,"author":6,"body":7,"date":1393,"description":1394,"extension":1395,"image":138,"locale":1396,"meta":1397,"navigation":522,"path":1398,"seo":1399,"stem":1400,"tags":1401,"__hash__":1406},"blog\u002Fblog\u002Fen\u002Fprometheus-monitoring-kubernetes-guide.md","The Complete Guide to Monitoring Kubernetes with Prometheus","Kubo Team",{"type":8,"value":9,"toc":1370},"minimark",[10,28,33,47,59,62,96,105,114,118,123,132,296,303,356,359,363,376,473,479,483,497,501,539,543,591,595,638,642,651,655,889,893,1116,1125,1132,1136,1151,1155,1194,1197,1201,1209,1229,1233,1248,1252,1307,1311,1314,1344,1354,1366],[11,12,13,14,21,22,27],"p",{},"Running Kubernetes in production without proper monitoring is flying blind. ",[15,16,20],"a",{"href":17,"rel":18},"https:\u002F\u002Fprometheus.io\u002F",[19],"nofollow","Prometheus",", a CNCF Graduated project, has established itself as the de facto standard for cloud-native monitoring. Whether you are running a large-scale cluster or a lightweight K3s-based platform like ",[15,23,26],{"href":24,"rel":25},"https:\u002F\u002Fkubo.hexabase.io\u002F",[19],"Kubo",", Prometheus provides powerful monitoring capabilities with minimal overhead. This guide covers everything from initial setup to production-grade operations.",[29,30,32],"h2",{"id":31},"prometheus-architecture-and-core-concepts","Prometheus Architecture and Core Concepts",[11,34,35,36,41,42,46],{},"Prometheus was originally developed at SoundCloud in 2012 and became the second project to join the CNCF after Kubernetes. As described in the ",[15,37,40],{"href":38,"rel":39},"https:\u002F\u002Fprometheus.io\u002Fdocs\u002Fintroduction\u002Foverview\u002F",[19],"official overview",", its distinguishing characteristic is a ",[43,44,45],"strong",{},"pull-based"," metrics collection model: the Prometheus server periodically scrapes HTTP endpoints from monitored targets and stores the data as time series.",[11,48,49,50,53,54,58],{},"The data model is ",[43,51,52],{},"multi-dimensional"," -- each time series is identified by a metric name and a set of key-value pairs called labels. For example, ",[55,56,57],"code",{},"http_requests_total{method=\"GET\", status=\"200\"}"," allows filtering and aggregation across multiple dimensions from a single metric.",[11,60,61],{},"The core components include:",[63,64,65,72,78,84,90],"ul",{},[66,67,68,71],"li",{},[43,69,70],{},"Prometheus Server",": Scrapes and stores time series data",[66,73,74,77],{},[43,75,76],{},"Alertmanager",": Handles alert routing, deduplication, and notifications (Slack, PagerDuty, email)",[66,79,80,83],{},[43,81,82],{},"Pushgateway",": An intermediary for short-lived batch jobs to push metrics",[66,85,86,89],{},[43,87,88],{},"Exporters",": Node Exporter (hardware\u002FOS metrics), kube-state-metrics (Kubernetes object states), and many others",[66,91,92,95],{},[43,93,94],{},"Client Libraries",": Available for Go, Java, Python, Ruby, and more",[11,97,98,99,104],{},"According to the ",[15,100,103],{"href":101,"rel":102},"https:\u002F\u002Fwww.sysdig.com\u002Fblog\u002Fkubernetes-monitoring-prometheus",[19],"Sysdig comprehensive guide",", Prometheus servers are autonomous -- they run as standalone Go binaries with no dependency on distributed storage, making deployment and operations remarkably simple.",[11,106,107,108,113],{},"If you are interested in AI-powered operations automation, see how ",[15,109,112],{"href":110,"rel":111},"https:\u002F\u002Fwww.hexabase.com\u002Fproduct\u002Fcaptain-ai\u002F",[19],"Captain.AI"," enhances Kubernetes operational efficiency.",[29,115,117],{"id":116},"deploying-prometheus-on-kubernetes","Deploying Prometheus on Kubernetes",[119,120,122],"h3",{"id":121},"declarative-management-with-prometheus-operator","Declarative Management with Prometheus Operator",[11,124,125,126,131],{},"For production environments, the ",[15,127,130],{"href":128,"rel":129},"https:\u002F\u002Fwww.plural.sh\u002Fblog\u002Fprometheus-operator-kubernetes-guide\u002F",[19],"Prometheus Operator"," is the recommended approach. It uses Kubernetes Custom Resource Definitions (CRDs) to declaratively manage the entire Prometheus configuration through manifests.",[133,134,139],"pre",{"className":135,"code":136,"language":137,"meta":138,"style":138},"language-yaml shiki shiki-themes tokyo-night","apiVersion: monitoring.coreos.com\u002Fv1\nkind: Prometheus\nmetadata:\n  name: k8s-prometheus\n  namespace: monitoring\nspec:\n  replicas: 2\n  serviceAccountName: prometheus\n  serviceMonitorSelector:\n    matchLabels:\n      team: platform\n  retention: 30d\n  resources:\n    requests:\n      memory: 400Mi\n","yaml","",[55,140,141,158,169,178,189,200,208,220,231,239,247,258,269,277,285],{"__ignoreMap":138},[142,143,146,150,154],"span",{"class":144,"line":145},"line",1,[142,147,149],{"class":148},"s0U2E","apiVersion",[142,151,153],{"class":152},"sAklC",":",[142,155,157],{"class":156},"sPY7s"," monitoring.coreos.com\u002Fv1\n",[142,159,161,164,166],{"class":144,"line":160},2,[142,162,163],{"class":148},"kind",[142,165,153],{"class":152},[142,167,168],{"class":156}," Prometheus\n",[142,170,172,175],{"class":144,"line":171},3,[142,173,174],{"class":148},"metadata",[142,176,177],{"class":152},":\n",[142,179,181,184,186],{"class":144,"line":180},4,[142,182,183],{"class":148},"  name",[142,185,153],{"class":152},[142,187,188],{"class":156}," k8s-prometheus\n",[142,190,192,195,197],{"class":144,"line":191},5,[142,193,194],{"class":148},"  namespace",[142,196,153],{"class":152},[142,198,199],{"class":156}," monitoring\n",[142,201,203,206],{"class":144,"line":202},6,[142,204,205],{"class":148},"spec",[142,207,177],{"class":152},[142,209,211,214,216],{"class":144,"line":210},7,[142,212,213],{"class":148},"  replicas",[142,215,153],{"class":152},[142,217,219],{"class":218},"sOJ5S"," 2\n",[142,221,223,226,228],{"class":144,"line":222},8,[142,224,225],{"class":148},"  serviceAccountName",[142,227,153],{"class":152},[142,229,230],{"class":156}," prometheus\n",[142,232,234,237],{"class":144,"line":233},9,[142,235,236],{"class":148},"  serviceMonitorSelector",[142,238,177],{"class":152},[142,240,242,245],{"class":144,"line":241},10,[142,243,244],{"class":148},"    matchLabels",[142,246,177],{"class":152},[142,248,250,253,255],{"class":144,"line":249},11,[142,251,252],{"class":148},"      team",[142,254,153],{"class":152},[142,256,257],{"class":156}," platform\n",[142,259,261,264,266],{"class":144,"line":260},12,[142,262,263],{"class":148},"  retention",[142,265,153],{"class":152},[142,267,268],{"class":156}," 30d\n",[142,270,272,275],{"class":144,"line":271},13,[142,273,274],{"class":148},"  resources",[142,276,177],{"class":152},[142,278,280,283],{"class":144,"line":279},14,[142,281,282],{"class":148},"    requests",[142,284,177],{"class":152},[142,286,288,291,293],{"class":144,"line":287},15,[142,289,290],{"class":148},"      memory",[142,292,153],{"class":152},[142,294,295],{"class":156}," 400Mi\n",[11,297,298,299,302],{},"The quickest path to a full monitoring stack is the ",[43,300,301],{},"kube-prometheus-stack"," Helm chart:",[133,304,308],{"className":305,"code":306,"language":307,"meta":138,"style":138},"language-bash shiki shiki-themes tokyo-night","helm repo add prometheus-community https:\u002F\u002Fprometheus-community.github.io\u002Fhelm-charts\nhelm install kube-prometheus prometheus-community\u002Fkube-prometheus-stack \\\n  --namespace monitoring --create-namespace\n","bash",[55,309,310,328,344],{"__ignoreMap":138},[142,311,312,316,319,322,325],{"class":144,"line":145},[142,313,315],{"class":314},"sE3pS","helm",[142,317,318],{"class":156}," repo",[142,320,321],{"class":156}," add",[142,323,324],{"class":156}," prometheus-community",[142,326,327],{"class":156}," https:\u002F\u002Fprometheus-community.github.io\u002Fhelm-charts\n",[142,329,330,332,335,338,341],{"class":144,"line":160},[142,331,315],{"class":314},[142,333,334],{"class":156}," install",[142,336,337],{"class":156}," kube-prometheus",[142,339,340],{"class":156}," prometheus-community\u002Fkube-prometheus-stack",[142,342,343],{"class":152}," \\\n",[142,345,346,350,353],{"class":144,"line":171},[142,347,349],{"class":348},"sT800","  --namespace",[142,351,352],{"class":156}," monitoring",[142,354,355],{"class":348}," --create-namespace\n",[11,357,358],{},"This chart bundles Prometheus Server, Alertmanager, Grafana, Node Exporter, and kube-state-metrics for an out-of-the-box monitoring stack.",[119,360,362],{"id":361},"kubernetes-service-discovery","Kubernetes Service Discovery",[11,364,365,366,371,372,375],{},"Prometheus integrates with the Kubernetes API to ",[15,367,370],{"href":368,"rel":369},"https:\u002F\u002Fprometheus.io\u002Fdocs\u002Fprometheus\u002Flatest\u002Fconfiguration\u002Fconfiguration\u002F#kubernetes_sd_config",[19],"automatically discover"," Pods, Services, Endpoints, and Nodes. Using ",[55,373,374],{},"ServiceMonitor"," resources, you can flexibly add monitoring targets based on labels:",[133,377,379],{"className":135,"code":378,"language":137,"meta":138,"style":138},"apiVersion: monitoring.coreos.com\u002Fv1\nkind: ServiceMonitor\nmetadata:\n  name: my-app-monitor\nspec:\n  selector:\n    matchLabels:\n      app: my-app\n  endpoints:\n  - port: metrics\n    interval: 15s\n",[55,380,381,389,398,404,413,419,426,432,442,449,463],{"__ignoreMap":138},[142,382,383,385,387],{"class":144,"line":145},[142,384,149],{"class":148},[142,386,153],{"class":152},[142,388,157],{"class":156},[142,390,391,393,395],{"class":144,"line":160},[142,392,163],{"class":148},[142,394,153],{"class":152},[142,396,397],{"class":156}," ServiceMonitor\n",[142,399,400,402],{"class":144,"line":171},[142,401,174],{"class":148},[142,403,177],{"class":152},[142,405,406,408,410],{"class":144,"line":180},[142,407,183],{"class":148},[142,409,153],{"class":152},[142,411,412],{"class":156}," my-app-monitor\n",[142,414,415,417],{"class":144,"line":191},[142,416,205],{"class":148},[142,418,177],{"class":152},[142,420,421,424],{"class":144,"line":202},[142,422,423],{"class":148},"  selector",[142,425,177],{"class":152},[142,427,428,430],{"class":144,"line":210},[142,429,244],{"class":148},[142,431,177],{"class":152},[142,433,434,437,439],{"class":144,"line":222},[142,435,436],{"class":148},"      app",[142,438,153],{"class":152},[142,440,441],{"class":156}," my-app\n",[142,443,444,447],{"class":144,"line":233},[142,445,446],{"class":148},"  endpoints",[142,448,177],{"class":152},[142,450,451,455,458,460],{"class":144,"line":241},[142,452,454],{"class":453},"sgJMe","  -",[142,456,457],{"class":148}," port",[142,459,153],{"class":152},[142,461,462],{"class":156}," metrics\n",[142,464,465,468,470],{"class":144,"line":249},[142,466,467],{"class":148},"    interval",[142,469,153],{"class":152},[142,471,472],{"class":156}," 15s\n",[11,474,475,478],{},[15,476,26],{"href":24,"rel":477},[19]," is built on K3s with strong affinity for the CNCF ecosystem, making Prometheus deployment seamless.",[29,480,482],{"id":481},"mastering-promql-practical-query-examples","Mastering PromQL: Practical Query Examples",[11,484,485,490,491,496],{},[15,486,489],{"href":487,"rel":488},"https:\u002F\u002Fprometheus.io\u002Fdocs\u002Fprometheus\u002Flatest\u002Fquerying\u002Fbasics\u002F",[19],"PromQL"," (Prometheus Query Language) is the powerful query language that unlocks the full potential of Prometheus' multi-dimensional data model. As emphasized by the ",[15,492,495],{"href":493,"rel":494},"https:\u002F\u002Flogz.io\u002Fblog\u002Fkubernetes-monitoring-prometheus-guide\u002F",[19],"Logz.io guide",", well-designed PromQL queries are the key to proactive monitoring.",[119,498,500],{"id":499},"cpu-and-memory-utilization","CPU and Memory Utilization",[133,502,506],{"className":503,"code":504,"language":505,"meta":138,"style":138},"language-promql shiki shiki-themes tokyo-night","# Node CPU usage (%)\n100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)\n\n# Pod memory usage as percentage of limits\ncontainer_memory_working_set_bytes{container!=\"POD\",container!=\"\"}\n  \u002F on(namespace, pod) kube_pod_container_resource_limits{resource=\"memory\"} * 100\n","promql",[55,507,508,513,518,524,529,534],{"__ignoreMap":138},[142,509,510],{"class":144,"line":145},[142,511,512],{},"# Node CPU usage (%)\n",[142,514,515],{"class":144,"line":160},[142,516,517],{},"100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)\n",[142,519,520],{"class":144,"line":171},[142,521,523],{"emptyLinePlaceholder":522},true,"\n",[142,525,526],{"class":144,"line":180},[142,527,528],{},"# Pod memory usage as percentage of limits\n",[142,530,531],{"class":144,"line":191},[142,532,533],{},"container_memory_working_set_bytes{container!=\"POD\",container!=\"\"}\n",[142,535,536],{"class":144,"line":202},[142,537,538],{},"  \u002F on(namespace, pod) kube_pod_container_resource_limits{resource=\"memory\"} * 100\n",[119,540,542],{"id":541},"request-rate-and-error-rate-red-method","Request Rate and Error Rate (RED Method)",[133,544,546],{"className":503,"code":545,"language":505,"meta":138,"style":138},"# Request rate (per second)\nsum(rate(http_requests_total[5m])) by (service)\n\n# Error rate (%)\nsum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)\n  \u002F sum(rate(http_requests_total[5m])) by (service) * 100\n\n# P99 latency\nhistogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))\n",[55,547,548,553,558,562,567,572,577,581,586],{"__ignoreMap":138},[142,549,550],{"class":144,"line":145},[142,551,552],{},"# Request rate (per second)\n",[142,554,555],{"class":144,"line":160},[142,556,557],{},"sum(rate(http_requests_total[5m])) by (service)\n",[142,559,560],{"class":144,"line":171},[142,561,523],{"emptyLinePlaceholder":522},[142,563,564],{"class":144,"line":180},[142,565,566],{},"# Error rate (%)\n",[142,568,569],{"class":144,"line":191},[142,570,571],{},"sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)\n",[142,573,574],{"class":144,"line":202},[142,575,576],{},"  \u002F sum(rate(http_requests_total[5m])) by (service) * 100\n",[142,578,579],{"class":144,"line":210},[142,580,523],{"emptyLinePlaceholder":522},[142,582,583],{"class":144,"line":222},[142,584,585],{},"# P99 latency\n",[142,587,588],{"class":144,"line":233},[142,589,590],{},"histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))\n",[119,592,594],{"id":593},"kubernetes-specific-queries","Kubernetes-Specific Queries",[133,596,598],{"className":503,"code":597,"language":505,"meta":138,"style":138},"# Count of Pending pods\nkube_pod_status_phase{phase=\"Pending\"}\n\n# Detect CrashLoopBackOff\nincrease(kube_pod_container_status_restarts_total[1h]) > 5\n\n# PVC usage percentage\nkubelet_volume_stats_used_bytes \u002F kubelet_volume_stats_capacity_bytes * 100\n",[55,599,600,605,610,614,619,624,628,633],{"__ignoreMap":138},[142,601,602],{"class":144,"line":145},[142,603,604],{},"# Count of Pending pods\n",[142,606,607],{"class":144,"line":160},[142,608,609],{},"kube_pod_status_phase{phase=\"Pending\"}\n",[142,611,612],{"class":144,"line":171},[142,613,523],{"emptyLinePlaceholder":522},[142,615,616],{"class":144,"line":180},[142,617,618],{},"# Detect CrashLoopBackOff\n",[142,620,621],{"class":144,"line":191},[142,622,623],{},"increase(kube_pod_container_status_restarts_total[1h]) > 5\n",[142,625,626],{"class":144,"line":202},[142,627,523],{"emptyLinePlaceholder":522},[142,629,630],{"class":144,"line":210},[142,631,632],{},"# PVC usage percentage\n",[142,634,635],{"class":144,"line":222},[142,636,637],{},"kubelet_volume_stats_used_bytes \u002F kubelet_volume_stats_capacity_bytes * 100\n",[29,639,641],{"id":640},"alert-design-with-alertmanager","Alert Design with Alertmanager",[11,643,644,645,650],{},"Reliable operations require a well-designed alerting strategy. By combining ",[15,646,649],{"href":647,"rel":648},"https:\u002F\u002Fprometheus.io\u002Fdocs\u002Falerting\u002Flatest\u002Falerting_rules\u002F",[19],"Prometheus alerting rules"," with Alertmanager, you can achieve early fault detection and targeted notifications.",[119,652,654],{"id":653},"defining-alert-rules","Defining Alert Rules",[133,656,658],{"className":135,"code":657,"language":137,"meta":138,"style":138},"apiVersion: monitoring.coreos.com\u002Fv1\nkind: PrometheusRule\nmetadata:\n  name: kubernetes-alerts\n  namespace: monitoring\nspec:\n  groups:\n  - name: kubernetes.rules\n    rules:\n    - alert: PodCrashLooping\n      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5\n      for: 10m\n      labels:\n        severity: warning\n      annotations:\n        summary: \"Pod {{ $labels.namespace }}\u002F{{ $labels.pod }} is restarting frequently\"\n    - alert: HighMemoryUsage\n      expr: |\n        container_memory_working_set_bytes{container!=\"POD\",container!=\"\"}\n        \u002F on(namespace,pod) kube_pod_container_resource_limits{resource=\"memory\"} > 0.9\n      for: 5m\n      labels:\n        severity: critical\n      annotations:\n        summary: \"Memory usage exceeds 90%\"\n",[55,659,660,668,677,683,692,700,706,713,725,732,745,755,765,772,782,789,806,818,829,835,841,851,858,868,875],{"__ignoreMap":138},[142,661,662,664,666],{"class":144,"line":145},[142,663,149],{"class":148},[142,665,153],{"class":152},[142,667,157],{"class":156},[142,669,670,672,674],{"class":144,"line":160},[142,671,163],{"class":148},[142,673,153],{"class":152},[142,675,676],{"class":156}," PrometheusRule\n",[142,678,679,681],{"class":144,"line":171},[142,680,174],{"class":148},[142,682,177],{"class":152},[142,684,685,687,689],{"class":144,"line":180},[142,686,183],{"class":148},[142,688,153],{"class":152},[142,690,691],{"class":156}," kubernetes-alerts\n",[142,693,694,696,698],{"class":144,"line":191},[142,695,194],{"class":148},[142,697,153],{"class":152},[142,699,199],{"class":156},[142,701,702,704],{"class":144,"line":202},[142,703,205],{"class":148},[142,705,177],{"class":152},[142,707,708,711],{"class":144,"line":210},[142,709,710],{"class":148},"  groups",[142,712,177],{"class":152},[142,714,715,717,720,722],{"class":144,"line":222},[142,716,454],{"class":453},[142,718,719],{"class":148}," name",[142,721,153],{"class":152},[142,723,724],{"class":156}," kubernetes.rules\n",[142,726,727,730],{"class":144,"line":233},[142,728,729],{"class":148},"    rules",[142,731,177],{"class":152},[142,733,734,737,740,742],{"class":144,"line":241},[142,735,736],{"class":453},"    -",[142,738,739],{"class":148}," alert",[142,741,153],{"class":152},[142,743,744],{"class":156}," PodCrashLooping\n",[142,746,747,750,752],{"class":144,"line":249},[142,748,749],{"class":148},"      expr",[142,751,153],{"class":152},[142,753,754],{"class":156}," increase(kube_pod_container_status_restarts_total[1h]) > 5\n",[142,756,757,760,762],{"class":144,"line":260},[142,758,759],{"class":148},"      for",[142,761,153],{"class":152},[142,763,764],{"class":156}," 10m\n",[142,766,767,770],{"class":144,"line":271},[142,768,769],{"class":148},"      labels",[142,771,177],{"class":152},[142,773,774,777,779],{"class":144,"line":279},[142,775,776],{"class":148},"        severity",[142,778,153],{"class":152},[142,780,781],{"class":156}," warning\n",[142,783,784,787],{"class":144,"line":287},[142,785,786],{"class":148},"      annotations",[142,788,177],{"class":152},[142,790,792,795,797,800,803],{"class":144,"line":791},16,[142,793,794],{"class":148},"        summary",[142,796,153],{"class":152},[142,798,799],{"class":152}," \"",[142,801,802],{"class":156},"Pod {{ $labels.namespace }}\u002F{{ $labels.pod }} is restarting frequently",[142,804,805],{"class":152},"\"\n",[142,807,809,811,813,815],{"class":144,"line":808},17,[142,810,736],{"class":453},[142,812,739],{"class":148},[142,814,153],{"class":152},[142,816,817],{"class":156}," HighMemoryUsage\n",[142,819,821,823,825],{"class":144,"line":820},18,[142,822,749],{"class":148},[142,824,153],{"class":152},[142,826,828],{"class":827},"sd1Qi"," |\n",[142,830,832],{"class":144,"line":831},19,[142,833,834],{"class":156},"        container_memory_working_set_bytes{container!=\"POD\",container!=\"\"}\n",[142,836,838],{"class":144,"line":837},20,[142,839,840],{"class":156},"        \u002F on(namespace,pod) kube_pod_container_resource_limits{resource=\"memory\"} > 0.9\n",[142,842,844,846,848],{"class":144,"line":843},21,[142,845,759],{"class":148},[142,847,153],{"class":152},[142,849,850],{"class":156}," 5m\n",[142,852,854,856],{"class":144,"line":853},22,[142,855,769],{"class":148},[142,857,177],{"class":152},[142,859,861,863,865],{"class":144,"line":860},23,[142,862,776],{"class":148},[142,864,153],{"class":152},[142,866,867],{"class":156}," critical\n",[142,869,871,873],{"class":144,"line":870},24,[142,872,786],{"class":148},[142,874,177],{"class":152},[142,876,878,880,882,884,887],{"class":144,"line":877},25,[142,879,794],{"class":148},[142,881,153],{"class":152},[142,883,799],{"class":152},[142,885,886],{"class":156},"Memory usage exceeds 90%",[142,888,805],{"class":152},[119,890,892],{"id":891},"alertmanager-notification-configuration","Alertmanager Notification Configuration",[133,894,896],{"className":135,"code":895,"language":137,"meta":138,"style":138},"route:\n  receiver: 'slack-notifications'\n  group_by: ['alertname', 'namespace']\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 4h\n  routes:\n  - match:\n      severity: critical\n    receiver: 'pagerduty-critical'\n\nreceivers:\n- name: 'slack-notifications'\n  slack_configs:\n  - channel: '#alerts'\n    send_resolved: true\n- name: 'pagerduty-critical'\n  pagerduty_configs:\n  - service_key: '\u003Cservice-key>'\n",[55,897,898,905,921,952,962,971,981,988,997,1006,1020,1024,1031,1046,1053,1069,1079,1093,1100],{"__ignoreMap":138},[142,899,900,903],{"class":144,"line":145},[142,901,902],{"class":148},"route",[142,904,177],{"class":152},[142,906,907,910,912,915,918],{"class":144,"line":160},[142,908,909],{"class":148},"  receiver",[142,911,153],{"class":152},[142,913,914],{"class":152}," '",[142,916,917],{"class":156},"slack-notifications",[142,919,920],{"class":152},"'\n",[142,922,923,926,928,931,934,937,939,942,944,947,949],{"class":144,"line":171},[142,924,925],{"class":148},"  group_by",[142,927,153],{"class":152},[142,929,930],{"class":152}," [",[142,932,933],{"class":152},"'",[142,935,936],{"class":156},"alertname",[142,938,933],{"class":152},[142,940,941],{"class":152},",",[142,943,914],{"class":152},[142,945,946],{"class":156},"namespace",[142,948,933],{"class":152},[142,950,951],{"class":152},"]\n",[142,953,954,957,959],{"class":144,"line":180},[142,955,956],{"class":148},"  group_wait",[142,958,153],{"class":152},[142,960,961],{"class":156}," 30s\n",[142,963,964,967,969],{"class":144,"line":191},[142,965,966],{"class":148},"  group_interval",[142,968,153],{"class":152},[142,970,850],{"class":156},[142,972,973,976,978],{"class":144,"line":202},[142,974,975],{"class":148},"  repeat_interval",[142,977,153],{"class":152},[142,979,980],{"class":156}," 4h\n",[142,982,983,986],{"class":144,"line":210},[142,984,985],{"class":148},"  routes",[142,987,177],{"class":152},[142,989,990,992,995],{"class":144,"line":222},[142,991,454],{"class":453},[142,993,994],{"class":148}," match",[142,996,177],{"class":152},[142,998,999,1002,1004],{"class":144,"line":233},[142,1000,1001],{"class":148},"      severity",[142,1003,153],{"class":152},[142,1005,867],{"class":156},[142,1007,1008,1011,1013,1015,1018],{"class":144,"line":241},[142,1009,1010],{"class":148},"    receiver",[142,1012,153],{"class":152},[142,1014,914],{"class":152},[142,1016,1017],{"class":156},"pagerduty-critical",[142,1019,920],{"class":152},[142,1021,1022],{"class":144,"line":249},[142,1023,523],{"emptyLinePlaceholder":522},[142,1025,1026,1029],{"class":144,"line":260},[142,1027,1028],{"class":148},"receivers",[142,1030,177],{"class":152},[142,1032,1033,1036,1038,1040,1042,1044],{"class":144,"line":271},[142,1034,1035],{"class":453},"-",[142,1037,719],{"class":148},[142,1039,153],{"class":152},[142,1041,914],{"class":152},[142,1043,917],{"class":156},[142,1045,920],{"class":152},[142,1047,1048,1051],{"class":144,"line":279},[142,1049,1050],{"class":148},"  slack_configs",[142,1052,177],{"class":152},[142,1054,1055,1057,1060,1062,1064,1067],{"class":144,"line":287},[142,1056,454],{"class":453},[142,1058,1059],{"class":148}," channel",[142,1061,153],{"class":152},[142,1063,914],{"class":152},[142,1065,1066],{"class":156},"#alerts",[142,1068,920],{"class":152},[142,1070,1071,1074,1076],{"class":144,"line":791},[142,1072,1073],{"class":148},"    send_resolved",[142,1075,153],{"class":152},[142,1077,1078],{"class":218}," true\n",[142,1080,1081,1083,1085,1087,1089,1091],{"class":144,"line":808},[142,1082,1035],{"class":453},[142,1084,719],{"class":148},[142,1086,153],{"class":152},[142,1088,914],{"class":152},[142,1090,1017],{"class":156},[142,1092,920],{"class":152},[142,1094,1095,1098],{"class":144,"line":820},[142,1096,1097],{"class":148},"  pagerduty_configs",[142,1099,177],{"class":152},[142,1101,1102,1104,1107,1109,1111,1114],{"class":144,"line":831},[142,1103,454],{"class":453},[142,1105,1106],{"class":148}," service_key",[142,1108,153],{"class":152},[142,1110,914],{"class":152},[142,1112,1113],{"class":156},"\u003Cservice-key>",[142,1115,920],{"class":152},[11,1117,1118,1119,1124],{},"As the ",[15,1120,1123],{"href":1121,"rel":1122},"https:\u002F\u002Fwww.apptio.com\u002Ftopics\u002Fkubernetes\u002Fmonitoring\u002Fprometheus\u002F",[19],"Apptio guide"," notes, running Alertmanager as a separate process ensures alerting continues to function even when Prometheus itself encounters issues.",[11,1126,1127,1128,1131],{},"By integrating with ",[15,1129,112],{"href":110,"rel":1130},[19],", you can build workflows where AI automatically analyzes root causes when alerts fire and suggests remediation actions.",[29,1133,1135],{"id":1134},"production-best-practices","Production Best Practices",[11,1137,1138,1139,1144,1145,1150],{},"Drawing from the ",[15,1140,1143],{"href":1141,"rel":1142},"https:\u002F\u002Ftrilio.io\u002Fkubernetes-best-practices\u002Fkubernetes-monitoring-best-practices\u002F",[19],"Trilio best practices guide"," and the ",[15,1146,1149],{"href":1147,"rel":1148},"https:\u002F\u002Fwww.plural.sh\u002Fblog\u002Fprometheus-kubernetes-monitoring-guide\u002F",[19],"Plural guide",", here are recommended configurations for production environments.",[119,1152,1154],{"id":1153},"high-availability","High Availability",[133,1156,1158],{"className":135,"code":1157,"language":137,"meta":138,"style":138},"spec:\n  replicas: 2\n  shards: 1\n  replicaExternalLabelName: __replica__\n",[55,1159,1160,1166,1174,1184],{"__ignoreMap":138},[142,1161,1162,1164],{"class":144,"line":145},[142,1163,205],{"class":148},[142,1165,177],{"class":152},[142,1167,1168,1170,1172],{"class":144,"line":160},[142,1169,213],{"class":148},[142,1171,153],{"class":152},[142,1173,219],{"class":218},[142,1175,1176,1179,1181],{"class":144,"line":171},[142,1177,1178],{"class":148},"  shards",[142,1180,153],{"class":152},[142,1182,1183],{"class":218}," 1\n",[142,1185,1186,1189,1191],{"class":144,"line":180},[142,1187,1188],{"class":148},"  replicaExternalLabelName",[142,1190,153],{"class":152},[142,1192,1193],{"class":156}," __replica__\n",[11,1195,1196],{},"Run multiple replicas and use Thanos or Cortex for long-term storage and a global query view.",[119,1198,1200],{"id":1199},"metrics-selection-and-optimization","Metrics Selection and Optimization",[11,1202,1118,1203,1208],{},[15,1204,1207],{"href":1205,"rel":1206},"https:\u002F\u002Fwww.tasrieit.com\u002Fblog\u002Fprometheus-monitoring-kubernetes-complete-guide-2026",[19],"Tasrie IT guide"," warns, indiscriminate collection of all available metrics leads to excessive storage costs. Adopt these strategies:",[63,1210,1211,1217,1223],{},[66,1212,1213,1216],{},[43,1214,1215],{},"Control label cardinality",": Avoid high-cardinality labels such as user IDs or request IDs",[66,1218,1219,1222],{},[43,1220,1221],{},"Use Recording Rules",": Pre-compute frequently used queries to reduce query load",[66,1224,1225,1228],{},[43,1226,1227],{},"Set appropriate retention",": Keep 15-30 days locally and offload to remote storage for long-term retention",[119,1230,1232],{"id":1231},"security-hardening","Security Hardening",[63,1234,1235,1238,1241],{},[66,1236,1237],{},"Restrict Prometheus access with NetworkPolicies",[66,1239,1240],{},"Apply the principle of least privilege with RBAC",[66,1242,1243,1244,1247],{},"Authenticate and encrypt ",[55,1245,1246],{},"\u002Fmetrics"," endpoints",[119,1249,1251],{"id":1250},"remote-storage-integration","Remote Storage Integration",[133,1253,1255],{"className":135,"code":1254,"language":137,"meta":138,"style":138},"remoteWrite:\n- url: \"http:\u002F\u002Fthanos-receive:19291\u002Fapi\u002Fv1\u002Freceive\"\n  queueConfig:\n    maxSamplesPerSend: 1000\n    batchSendDeadline: 5s\n",[55,1256,1257,1264,1280,1287,1297],{"__ignoreMap":138},[142,1258,1259,1262],{"class":144,"line":145},[142,1260,1261],{"class":148},"remoteWrite",[142,1263,177],{"class":152},[142,1265,1266,1268,1271,1273,1275,1278],{"class":144,"line":160},[142,1267,1035],{"class":453},[142,1269,1270],{"class":148}," url",[142,1272,153],{"class":152},[142,1274,799],{"class":152},[142,1276,1277],{"class":156},"http:\u002F\u002Fthanos-receive:19291\u002Fapi\u002Fv1\u002Freceive",[142,1279,805],{"class":152},[142,1281,1282,1285],{"class":144,"line":171},[142,1283,1284],{"class":148},"  queueConfig",[142,1286,177],{"class":152},[142,1288,1289,1292,1294],{"class":144,"line":180},[142,1290,1291],{"class":148},"    maxSamplesPerSend",[142,1293,153],{"class":152},[142,1295,1296],{"class":218}," 1000\n",[142,1298,1299,1302,1304],{"class":144,"line":191},[142,1300,1301],{"class":148},"    batchSendDeadline",[142,1303,153],{"class":152},[142,1305,1306],{"class":156}," 5s\n",[29,1308,1310],{"id":1309},"conclusion","Conclusion",[11,1312,1313],{},"Prometheus serves as the backbone of Kubernetes monitoring, providing a unified platform for metrics collection, visualization, and alerting. The key takeaways from this guide are:",[1315,1316,1317,1322,1328,1333,1338],"ol",{},[66,1318,1319,1321],{},[43,1320,130],{}," for declarative installation and management",[66,1323,1324,1327],{},[43,1325,1326],{},"Service Discovery"," for dynamic target detection",[66,1329,1330,1332],{},[43,1331,489],{}," for flexible querying and analysis",[66,1334,1335,1337],{},[43,1336,76],{}," for systematic alert design",[66,1339,1340,1343],{},[43,1341,1342],{},"HA configuration and remote storage"," for production-grade reliability",[11,1345,1346,1349,1350,1353],{},[15,1347,26],{"href":24,"rel":1348},[19]," is built on K3s with strong affinity for the CNCF ecosystem, providing an environment where monitoring tools like Prometheus can be deployed and utilized immediately. If you need help building or operating Kubernetes environments, consider ",[15,1351,26],{"href":24,"rel":1352},[19],".",[11,1355,1356,1357,1360,1361,1353],{},"For those interested in AI-powered Kubernetes operations automation, explore how ",[15,1358,112],{"href":110,"rel":1359},[19]," delivers intelligent operational support. For consultation, please reach out through our ",[15,1362,1365],{"href":1363,"rel":1364},"https:\u002F\u002Fwww.hexabase.com\u002Fcontact-us\u002F",[19],"contact page",[1367,1368,1369],"style",{},"html pre.shiki code .s0U2E, html code.shiki .s0U2E{--shiki-default:#F7768E}html pre.shiki code .sAklC, html code.shiki .sAklC{--shiki-default:#89DDFF}html pre.shiki code .sPY7s, html code.shiki .sPY7s{--shiki-default:#9ECE6A}html pre.shiki code .sOJ5S, html code.shiki .sOJ5S{--shiki-default:#FF9E64}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sE3pS, html code.shiki .sE3pS{--shiki-default:#C0CAF5}html pre.shiki code .sT800, html code.shiki .sT800{--shiki-default:#E0AF68}html pre.shiki code .sgJMe, html code.shiki .sgJMe{--shiki-default:#9ABDF5}html pre.shiki code .sd1Qi, html code.shiki .sd1Qi{--shiki-default:#BB9AF7}",{"title":138,"searchDepth":160,"depth":160,"links":1371},[1372,1373,1377,1382,1386,1392],{"id":31,"depth":160,"text":32},{"id":116,"depth":160,"text":117,"children":1374},[1375,1376],{"id":121,"depth":171,"text":122},{"id":361,"depth":171,"text":362},{"id":481,"depth":160,"text":482,"children":1378},[1379,1380,1381],{"id":499,"depth":171,"text":500},{"id":541,"depth":171,"text":542},{"id":593,"depth":171,"text":594},{"id":640,"depth":160,"text":641,"children":1383},[1384,1385],{"id":653,"depth":171,"text":654},{"id":891,"depth":171,"text":892},{"id":1134,"depth":160,"text":1135,"children":1387},[1388,1389,1390,1391],{"id":1153,"depth":171,"text":1154},{"id":1199,"depth":171,"text":1200},{"id":1231,"depth":171,"text":1232},{"id":1250,"depth":171,"text":1251},{"id":1309,"depth":160,"text":1310},"2026-05-27","Master Kubernetes monitoring with Prometheus: from installation and PromQL queries to Alertmanager configuration and production best practices.","md","en",{},"\u002Fblog\u002Fen\u002Fprometheus-monitoring-kubernetes-guide",{"title":5,"description":1394},"blog\u002Fen\u002Fprometheus-monitoring-kubernetes-guide",[20,1402,1403,1404,1405,489,76],"Kubernetes","Monitoring","CNCF","Observability","XRIxTrv27G1P1zUPoxNYgoGwQlwSD93F6ziqhM81E28",1779964619037]