Running Kubernetes in production without proper monitoring is flying blind. Prometheus, a CNCF Graduated project, has established itself as the de facto standard for cloud-native monitoring. Whether you are running a large-scale cluster or a lightweight K3s-based platform like Kubo, Prometheus provides powerful monitoring capabilities with minimal overhead. This guide covers everything from initial setup to production-grade operations.
Prometheus Architecture and Core Concepts
Prometheus was originally developed at SoundCloud in 2012 and became the second project to join the CNCF after Kubernetes. As described in the official overview, its distinguishing characteristic is a pull-based metrics collection model: the Prometheus server periodically scrapes HTTP endpoints from monitored targets and stores the data as time series.
The data model is multi-dimensional -- each time series is identified by a metric name and a set of key-value pairs called labels. For example, http_requests_total{method="GET", status="200"} allows filtering and aggregation across multiple dimensions from a single metric.
The core components include:
- Prometheus Server: Scrapes and stores time series data
- Alertmanager: Handles alert routing, deduplication, and notifications (Slack, PagerDuty, email)
- Pushgateway: An intermediary for short-lived batch jobs to push metrics
- Exporters: Node Exporter (hardware/OS metrics), kube-state-metrics (Kubernetes object states), and many others
- Client Libraries: Available for Go, Java, Python, Ruby, and more
According to the Sysdig comprehensive guide, Prometheus servers are autonomous -- they run as standalone Go binaries with no dependency on distributed storage, making deployment and operations remarkably simple.
If you are interested in AI-powered operations automation, see how Captain.AI enhances Kubernetes operational efficiency.
Deploying Prometheus on Kubernetes
Declarative Management with Prometheus Operator
For production environments, the Prometheus Operator is the recommended approach. It uses Kubernetes Custom Resource Definitions (CRDs) to declaratively manage the entire Prometheus configuration through manifests.
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: k8s-prometheus
namespace: monitoring
spec:
replicas: 2
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: platform
retention: 30d
resources:
requests:
memory: 400Mi
The quickest path to a full monitoring stack is the kube-prometheus-stack Helm chart:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
This chart bundles Prometheus Server, Alertmanager, Grafana, Node Exporter, and kube-state-metrics for an out-of-the-box monitoring stack.
Kubernetes Service Discovery
Prometheus integrates with the Kubernetes API to automatically discover Pods, Services, Endpoints, and Nodes. Using ServiceMonitor resources, you can flexibly add monitoring targets based on labels:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 15s
Kubo is built on K3s with strong affinity for the CNCF ecosystem, making Prometheus deployment seamless.
Mastering PromQL: Practical Query Examples
PromQL (Prometheus Query Language) is the powerful query language that unlocks the full potential of Prometheus' multi-dimensional data model. As emphasized by the Logz.io guide, well-designed PromQL queries are the key to proactive monitoring.
CPU and Memory Utilization
# Node CPU usage (%)
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Pod memory usage as percentage of limits
container_memory_working_set_bytes{container!="POD",container!=""}
/ on(namespace, pod) kube_pod_container_resource_limits{resource="memory"} * 100
Request Rate and Error Rate (RED Method)
# Request rate (per second)
sum(rate(http_requests_total[5m])) by (service)
# Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100
# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Kubernetes-Specific Queries
# Count of Pending pods
kube_pod_status_phase{phase="Pending"}
# Detect CrashLoopBackOff
increase(kube_pod_container_status_restarts_total[1h]) > 5
# PVC usage percentage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
Alert Design with Alertmanager
Reliable operations require a well-designed alerting strategy. By combining Prometheus alerting rules with Alertmanager, you can achieve early fault detection and targeted notifications.
Defining Alert Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes.rules
rules:
- alert: PodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
- alert: HighMemoryUsage
expr: |
container_memory_working_set_bytes{container!="POD",container!=""}
/ on(namespace,pod) kube_pod_container_resource_limits{resource="memory"} > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage exceeds 90%"
Alertmanager Notification Configuration
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<service-key>'
As the Apptio guide notes, running Alertmanager as a separate process ensures alerting continues to function even when Prometheus itself encounters issues.
By integrating with Captain.AI, you can build workflows where AI automatically analyzes root causes when alerts fire and suggests remediation actions.
Production Best Practices
Drawing from the Trilio best practices guide and the Plural guide, here are recommended configurations for production environments.
High Availability
spec:
replicas: 2
shards: 1
replicaExternalLabelName: __replica__
Run multiple replicas and use Thanos or Cortex for long-term storage and a global query view.
Metrics Selection and Optimization
As the Tasrie IT guide warns, indiscriminate collection of all available metrics leads to excessive storage costs. Adopt these strategies:
- Control label cardinality: Avoid high-cardinality labels such as user IDs or request IDs
- Use Recording Rules: Pre-compute frequently used queries to reduce query load
- Set appropriate retention: Keep 15-30 days locally and offload to remote storage for long-term retention
Security Hardening
- Restrict Prometheus access with NetworkPolicies
- Apply the principle of least privilege with RBAC
- Authenticate and encrypt
/metricsendpoints
Remote Storage Integration
remoteWrite:
- url: "http://thanos-receive:19291/api/v1/receive"
queueConfig:
maxSamplesPerSend: 1000
batchSendDeadline: 5s
Conclusion
Prometheus serves as the backbone of Kubernetes monitoring, providing a unified platform for metrics collection, visualization, and alerting. The key takeaways from this guide are:
- Prometheus Operator for declarative installation and management
- Service Discovery for dynamic target detection
- PromQL for flexible querying and analysis
- Alertmanager for systematic alert design
- HA configuration and remote storage for production-grade reliability
Kubo is built on K3s with strong affinity for the CNCF ecosystem, providing an environment where monitoring tools like Prometheus can be deployed and utilized immediately. If you need help building or operating Kubernetes environments, consider Kubo.
For those interested in AI-powered Kubernetes operations automation, explore how Captain.AI delivers intelligent operational support. For consultation, please reach out through our contact page.