Choosing a Monitoring Stack#

Monitoring is not optional. Without metrics, you are guessing. The question is not whether to monitor but which stack to use. The right choice depends on your cost tolerance, operational capacity, retention requirements, and how much you value control versus convenience.

Decision Criteria#

Before comparing tools, clarify what matters to your organization:

  • Cost model: Are you optimizing for infrastructure spend or engineering time? Self-managed tools cost less in licensing but more in operational hours. SaaS tools cost more in subscription fees but less in engineering effort.
  • Operational burden: Who manages the monitoring system? Do you have an infrastructure team, or are developers responsible for everything?
  • Data retention: Do you need metrics for 15 days, 90 days, or years? Long retention changes the equation significantly.
  • Query capability: Does your team know PromQL? Do they need ad-hoc analysis or mostly pre-built dashboards?
  • Alerting requirements: Simple threshold alerts, or complex multi-signal alerts with routing and escalation?
  • Team expertise: An organization fluent in Prometheus wastes that investment by switching to Datadog. An organization with no Prometheus experience faces a learning curve.

Options at a Glance#

Capability Prometheus + Grafana Prometheus + Thanos/Mimir VictoriaMetrics Datadog Cloud-Native Grafana Cloud
Cost model Infrastructure only Infrastructure only Infrastructure only Per host ($15-23/mo) Per metric/API call Per series/GB
Operational burden High Very high Medium None Low Low
Query language PromQL PromQL MetricsQL (PromQL-compatible) Datadog query language Vendor-specific PromQL, LogQL
Default retention 15 days (local disk) Unlimited (object storage) Unlimited (configurable) 15 months Varies (15 days - 15 months) Plan-dependent
HA built-in No (requires federation) Yes Yes (cluster mode) Yes Yes Yes
Multi-cluster Federation (limited) Yes (global view) Yes (cluster mode) Yes Per-account Yes
APM/Tracing No (separate tools) No (separate tools) No (separate tools) Yes (integrated) Varies Yes (Tempo)
Vendor lock-in None None Low High High Low-Medium

Prometheus + Grafana (Self-Managed)#

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull-based model, scraping metrics from endpoints at configurable intervals, and stores time series data on local disk. Grafana provides visualization. Alertmanager handles alert routing.

Choose Prometheus + Grafana when:

  • Cost is a primary concern and you want to avoid per-host or per-series fees.
  • Your team already knows PromQL and the Prometheus ecosystem.
  • You need full control over data residency, retention policies, and scrape configuration.
  • You operate on-premises or in a hybrid environment where SaaS is not an option.
  • Retention of 15-30 days is sufficient for your use case.

Limitations to understand:

  • You manage everything: deployment, storage, upgrades, HA, backup.
  • Local disk retention is limited. Running out of disk means losing metrics or provisioning larger volumes.
  • No built-in HA. Two Prometheus instances scraping the same targets produce different data (scrape timing differs). Deduplication requires Thanos or similar.
  • Federation for multi-cluster is cumbersome and lossy. You lose label granularity when federating.
  • Cardinality explosions (too many unique label combinations) can crash Prometheus or cause severe memory pressure.

Prometheus + Thanos or Mimir (Scaled Self-Managed)#

Thanos and Grafana Mimir extend Prometheus with long-term storage (S3, GCS, Azure Blob), global query view across clusters, and high availability through deduplication. Thanos uses a sidecar pattern; Mimir replaces Prometheus’s TSDB with its own distributed storage.

Choose Prometheus + Thanos/Mimir when:

  • You need months or years of metric retention for capacity planning, compliance, or business analytics.
  • Multi-cluster aggregation with a single global query endpoint is required.
  • High availability for metrics is a hard requirement (regulated industries, SLA commitments).
  • You want to keep the Prometheus/PromQL ecosystem but need it at scale.

Tradeoffs to understand:

  • Significant operational overhead. Thanos adds 5-7 additional components (sidecar, store gateway, compactor, querier, query frontend, ruler, receiver). Mimir is similarly complex.
  • Object storage costs. Storing months of metrics in S3/GCS is cheap per GB, but the volume of metric data accumulates.
  • Debugging becomes harder. When a query is slow, the problem could be in Prometheus, the store gateway, object storage, the compactor, or the query frontend.
  • Deployment and upgrade complexity is substantial. These are distributed systems with their own failure modes.

VictoriaMetrics#

VictoriaMetrics is a Prometheus-compatible monitoring solution with better compression, faster queries, and lower resource usage. It accepts data via Prometheus remote write, supports PromQL (with extensions called MetricsQL), and can operate in single-node or cluster mode.

Choose VictoriaMetrics when:

  • Prometheus is consuming too much memory or disk for your data volume.
  • You want better storage efficiency (VictoriaMetrics typically achieves 7-10x better compression than Prometheus).
  • You need a drop-in Prometheus replacement without changing dashboards, alerts, or scrape configs.
  • You want long-term retention without the complexity of Thanos/Mimir (single-node VictoriaMetrics handles surprisingly large workloads).
  • Cluster mode is needed for HA and horizontal scaling, but you want something simpler to operate than Thanos.

Limitations to understand:

  • Smaller community than Prometheus. Fewer blog posts, fewer StackOverflow answers, fewer battle-tested configurations to copy.
  • MetricsQL extensions are non-standard. Dashboards that use MetricsQL-specific functions will not work with vanilla Prometheus or Thanos.
  • Enterprise features (downsampling, access control, anomaly detection) require a paid license.
  • Vendor risk. VictoriaMetrics is a single company, not a broad community project (though it is open-source).

Datadog#

Datadog is a SaaS observability platform that includes metrics, logs, APM, distributed tracing, RUM (Real User Monitoring), synthetic tests, and security monitoring in a single product.

Choose Datadog when:

  • You want zero operational burden for monitoring infrastructure. No servers to manage, no upgrades to plan, no storage to provision.
  • Budget allows for per-host pricing ($15/host/month for infrastructure, $23/host/month for infrastructure + APM).
  • You need integrated APM, distributed tracing, and RUM alongside infrastructure metrics.
  • Your organization is large enough that a single pane of glass across teams provides significant value.
  • Time-to-value matters more than cost optimization. Datadog works out of the box with hundreds of integrations.

Limitations to understand:

  • Expensive at scale. 100 hosts at the APM tier costs $2,300/month before adding logs, synthetic tests, or security. Custom metrics above the included allocation cost $5 per 100 metrics/month.
  • Vendor lock-in. Dashboards, alerts, monitors, and SLOs are defined in Datadog’s format. Moving to another platform requires rebuilding everything.
  • Data leaves your infrastructure. Metrics, traces, and potentially log data are stored in Datadog’s cloud. This may conflict with data residency or sovereignty requirements.
  • Datadog’s query language is proprietary. PromQL skills do not transfer.

Cloud-Native Monitoring (CloudWatch, Azure Monitor, GCP Cloud Monitoring)#

Each major cloud provider offers an integrated monitoring service. These tools are deeply integrated with the provider’s services, require minimal setup, and bill alongside your other cloud resources.

Choose cloud-native monitoring when:

  • You operate exclusively in one cloud and want monitoring tightly integrated with that cloud’s services (auto-discovery of EC2 instances, RDS metrics, Lambda invocations).
  • Billing integration is valuable – monitoring costs appear alongside compute and storage.
  • Compliance requirements mandate that monitoring data stays within the cloud provider’s infrastructure.
  • Your monitoring needs are modest and mostly focused on cloud-managed services rather than custom application metrics.

Limitations to understand:

  • Weaker query languages compared to PromQL. CloudWatch Metrics Insights is improving but remains less powerful.
  • Expensive for high-cardinality custom metrics. CloudWatch charges $0.30 per custom metric per month; at 10,000 custom metrics, that is $3,000/month.
  • Vendor lock-in. Dashboards, alarms, and queries do not port to other providers.
  • Multi-cloud is painful. If you operate in two clouds, you need two monitoring systems or a third-party aggregation layer.

Grafana Cloud#

Grafana Cloud offers managed Prometheus (Mimir), Loki (logs), and Tempo (traces) with the Grafana UI. You get the Prometheus ecosystem without managing the infrastructure. Your existing PromQL dashboards and alerting rules work without modification.

Choose Grafana Cloud when:

  • You want the Prometheus/Grafana ecosystem without the operational burden of running it.
  • Your team has existing PromQL expertise and Grafana dashboards.
  • You need logs (Loki) + metrics (Mimir) + traces (Tempo) in one managed platform with open-source compatibility.
  • The free tier (10,000 series, 50GB logs, 50GB traces) covers your initial needs and you want to grow into it.

Limitations to understand:

  • Costs scale with active series and data volume. At high cardinality, costs can approach Datadog levels.
  • Some features (Adaptive Metrics, Grafana SLO) are Grafana-specific and create mild lock-in.
  • You still need to run collectors (Grafana Agent/Alloy or Prometheus) in your infrastructure to ship metrics.

Cost Modeling: 100 Nodes, 500 Series Each (50,000 Active Series)#

Solution Monthly Cost Estimate What is Included
Prometheus + Grafana (self-managed) $0 (software) + infrastructure You pay for the compute/storage running Prometheus and Grafana
VictoriaMetrics (self-managed) $0 (software) + infrastructure Less infrastructure than Prometheus due to better compression
Datadog (Infrastructure) $1,500/month (100 hosts x $15) Metrics, host maps, integrations, 15 months retention
Datadog (Infrastructure + APM) $2,300/month (100 hosts x $23) Above + APM, distributed tracing
AWS CloudWatch ~$1,500-3,000/month Depends on custom metric count and API call volume
Grafana Cloud (Pro) ~$600-1,200/month 50K series in the Pro tier; exact cost depends on retention and query volume

Self-managed costs depend entirely on the infrastructure you allocate. A typical Prometheus setup for 50,000 series might need 2-4 vCPU and 8-16GB RAM for Prometheus itself, plus a Grafana instance. At cloud compute prices, this is roughly $100-300/month in infrastructure.

Hybrid Patterns#

Prometheus for infrastructure + Datadog for APM: Use Prometheus for infrastructure and Kubernetes metrics (low cost, full control). Use Datadog specifically for APM and distributed tracing where its auto-instrumentation provides the most value. This limits Datadog costs to the hosts running instrumented applications.

Cloud-native for billing metrics + Prometheus for detailed metrics: Use CloudWatch/Azure Monitor for cloud-service metrics that are available automatically. Deploy Prometheus for custom application metrics and Kubernetes-level monitoring. This avoids recreating what the cloud provider offers for free.

Prometheus + Grafana Cloud for long-term storage: Run Prometheus locally for real-time monitoring with 15-day retention. Remote-write to Grafana Cloud for long-term storage and cross-cluster views. This keeps local monitoring fast while offloading retention to a managed service.

Decision Summary#

Budget-constrained, skilled team: Self-managed Prometheus + Grafana. Add VictoriaMetrics if Prometheus resource usage becomes a problem. Add Thanos/Mimir only when you genuinely need long-term retention or multi-cluster.

Moderate budget, want less operations: Grafana Cloud. You keep PromQL and your dashboards. You stop managing Prometheus infrastructure.

Generous budget, want everything integrated: Datadog. You get metrics, logs, traces, APM, and RUM in one platform with no infrastructure to manage.

Single cloud, minimal team: Cloud-native monitoring. It works out of the box and you do not need to deploy anything.

Starting fresh with no existing investment: Start with Prometheus + Grafana (learn the ecosystem), then evaluate whether to self-manage long-term or migrate to Grafana Cloud or Datadog based on your team’s operational appetite.