Skip to main content

The Complete Guide to Monitoring Kubernetes with Prometheus

Running Kubernetes in production without proper monitoring is flying blind. Prometheus, a CNCF Graduated project, has established itself as the de facto standard for cloud-native monitoring. Whether you are running a large-scale cluster or a lightweight K3s-based platform like Kubo, Prometheus provides powerful monitoring capabilities with minimal overhead. This guide covers everything from initial setup to production-grade operations.

Prometheus Architecture and Core Concepts

Prometheus was originally developed at SoundCloud in 2012 and became the second project to join the CNCF after Kubernetes. As described in the official overview, its distinguishing characteristic is a pull-based metrics collection model: the Prometheus server periodically scrapes HTTP endpoints from monitored targets and stores the data as time series.

The data model is multi-dimensional -- each time series is identified by a metric name and a set of key-value pairs called labels. For example, http_requests_total{method="GET", status="200"} allows filtering and aggregation across multiple dimensions from a single metric.

The core components include:

  • Prometheus Server: Scrapes and stores time series data
  • Alertmanager: Handles alert routing, deduplication, and notifications (Slack, PagerDuty, email)
  • Pushgateway: An intermediary for short-lived batch jobs to push metrics
  • Exporters: Node Exporter (hardware/OS metrics), kube-state-metrics (Kubernetes object states), and many others
  • Client Libraries: Available for Go, Java, Python, Ruby, and more

According to the Sysdig comprehensive guide, Prometheus servers are autonomous -- they run as standalone Go binaries with no dependency on distributed storage, making deployment and operations remarkably simple.

If you are interested in AI-powered operations automation, see how Captain.AI enhances Kubernetes operational efficiency.

Deploying Prometheus on Kubernetes

Declarative Management with Prometheus Operator

For production environments, the Prometheus Operator is the recommended approach. It uses Kubernetes Custom Resource Definitions (CRDs) to declaratively manage the entire Prometheus configuration through manifests.

yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: k8s-prometheus
  namespace: monitoring
spec:
  replicas: 2
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      team: platform
  retention: 30d
  resources:
    requests:
      memory: 400Mi

The quickest path to a full monitoring stack is the kube-prometheus-stack Helm chart:

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

This chart bundles Prometheus Server, Alertmanager, Grafana, Node Exporter, and kube-state-metrics for an out-of-the-box monitoring stack.

Kubernetes Service Discovery

Prometheus integrates with the Kubernetes API to automatically discover Pods, Services, Endpoints, and Nodes. Using ServiceMonitor resources, you can flexibly add monitoring targets based on labels:

yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 15s

Kubo is built on K3s with strong affinity for the CNCF ecosystem, making Prometheus deployment seamless.

Mastering PromQL: Practical Query Examples

PromQL (Prometheus Query Language) is the powerful query language that unlocks the full potential of Prometheus' multi-dimensional data model. As emphasized by the Logz.io guide, well-designed PromQL queries are the key to proactive monitoring.

CPU and Memory Utilization

promql
# Node CPU usage (%)
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Pod memory usage as percentage of limits
container_memory_working_set_bytes{container!="POD",container!=""}
  / on(namespace, pod) kube_pod_container_resource_limits{resource="memory"} * 100

Request Rate and Error Rate (RED Method)

promql
# Request rate (per second)
sum(rate(http_requests_total[5m])) by (service)

# Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  / sum(rate(http_requests_total[5m])) by (service) * 100

# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Kubernetes-Specific Queries

promql
# Count of Pending pods
kube_pod_status_phase{phase="Pending"}

# Detect CrashLoopBackOff
increase(kube_pod_container_status_restarts_total[1h]) > 5

# PVC usage percentage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

Alert Design with Alertmanager

Reliable operations require a well-designed alerting strategy. By combining Prometheus alerting rules with Alertmanager, you can achieve early fault detection and targeted notifications.

Defining Alert Rules

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes.rules
    rules:
    - alert: PodCrashLooping
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
    - alert: HighMemoryUsage
      expr: |
        container_memory_working_set_bytes{container!="POD",container!=""}
        / on(namespace,pod) kube_pod_container_resource_limits{resource="memory"} > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Memory usage exceeds 90%"

Alertmanager Notification Configuration

yaml
route:
  receiver: 'slack-notifications'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    send_resolved: true
- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: '<service-key>'

As the Apptio guide notes, running Alertmanager as a separate process ensures alerting continues to function even when Prometheus itself encounters issues.

By integrating with Captain.AI, you can build workflows where AI automatically analyzes root causes when alerts fire and suggests remediation actions.

Production Best Practices

Drawing from the Trilio best practices guide and the Plural guide, here are recommended configurations for production environments.

High Availability

yaml
spec:
  replicas: 2
  shards: 1
  replicaExternalLabelName: __replica__

Run multiple replicas and use Thanos or Cortex for long-term storage and a global query view.

Metrics Selection and Optimization

As the Tasrie IT guide warns, indiscriminate collection of all available metrics leads to excessive storage costs. Adopt these strategies:

  • Control label cardinality: Avoid high-cardinality labels such as user IDs or request IDs
  • Use Recording Rules: Pre-compute frequently used queries to reduce query load
  • Set appropriate retention: Keep 15-30 days locally and offload to remote storage for long-term retention

Security Hardening

  • Restrict Prometheus access with NetworkPolicies
  • Apply the principle of least privilege with RBAC
  • Authenticate and encrypt /metrics endpoints

Remote Storage Integration

yaml
remoteWrite:
- url: "http://thanos-receive:19291/api/v1/receive"
  queueConfig:
    maxSamplesPerSend: 1000
    batchSendDeadline: 5s

Conclusion

Prometheus serves as the backbone of Kubernetes monitoring, providing a unified platform for metrics collection, visualization, and alerting. The key takeaways from this guide are:

  1. Prometheus Operator for declarative installation and management
  2. Service Discovery for dynamic target detection
  3. PromQL for flexible querying and analysis
  4. Alertmanager for systematic alert design
  5. HA configuration and remote storage for production-grade reliability

Kubo is built on K3s with strong affinity for the CNCF ecosystem, providing an environment where monitoring tools like Prometheus can be deployed and utilized immediately. If you need help building or operating Kubernetes environments, consider Kubo.

For those interested in AI-powered Kubernetes operations automation, explore how Captain.AI delivers intelligent operational support. For consultation, please reach out through our contact page.

← Back to all posts