Choosing a Log Aggregation Stack#
Logs are the most fundamental observability signal. Every application produces them, every incident investigation starts with them, and every compliance framework requires retaining them. The challenge is not collecting logs – it is storing, indexing, querying, and retaining them at scale without spending a fortune.
The choice of log aggregation stack determines your query speed, operational burden, storage costs, and how effectively you can correlate logs with metrics and traces during incident response.
Decision Criteria#
Before evaluating tools, establish your requirements:
- Log volume: 1GB/day and 1TB/day require fundamentally different architectures. The tool that works at 10GB/day may collapse at 500GB/day.
- Query patterns: Do you need full-text search across all log content? Or do you mostly filter by known labels (service name, severity, pod) and then grep within results?
- Retention period: 7 days, 30 days, 1 year? Retention multiplied by daily volume determines storage costs.
- Correlation: Do you need to jump from a log line to the corresponding metric or trace? Tight integration between logs, metrics, and traces reduces mean time to resolution.
- Operational capacity: Do you have a team that can manage Elasticsearch clusters, or do you need something that runs without constant attention?
- Compliance: Do regulations require specific retention periods, immutable storage, or data residency?
Options at a Glance#
| Capability | Loki | Elasticsearch/OpenSearch | Cloud-Native Logs | Vector + ClickHouse |
|---|---|---|---|---|
| Indexing approach | Label-based (metadata only) | Full-text (inverted index on content) | Full-text (varies) | Column-oriented analytical |
| Query language | LogQL | KQL / Lucene / ES Query DSL | Vendor-specific | SQL (ClickHouse dialect) |
| Full-text search | Grep-like (filter then scan) | Native, fast | Yes | Yes (with full-text index) |
| Query speed (known labels) | Fast | Fast | Fast | Very fast |
| Query speed (unindexed search) | Slow over large ranges | Fast (everything is indexed) | Moderate | Fast (columnar compression) |
| Operational complexity | Low-Medium | High | None | Medium-High |
| Storage efficiency | Very high (compressed chunks) | Low (inverted index is large) | Managed | High (columnar compression) |
| Correlation with metrics | Native (Grafana, same labels) | Requires configuration | Vendor-dependent | Requires configuration |
| Cost model | Infrastructure only | Infrastructure only | Per GB ingested + stored | Infrastructure only |
Grafana Loki#
Loki is a log aggregation system designed by Grafana Labs. It indexes only metadata labels (like Prometheus labels), not the content of log lines. Log content is stored as compressed chunks in object storage (S3, GCS, filesystem). Queries filter by labels first, then grep through the matching chunks.
Choose Loki when:
- Cost is a primary driver. By not indexing log content, Loki uses dramatically less storage and compute than Elasticsearch for the same volume of logs.
- You already use Prometheus and Grafana. Loki uses the same label model as Prometheus, so you can correlate logs and metrics using identical label sets in the same Grafana dashboard.
- Your query pattern is “filter by service/namespace/pod, then search within those logs.” This is how most incident investigations work, and Loki handles it well.
- You want a system that is simple to operate compared to Elasticsearch. In monolithic or simple-scalable mode, Loki is a single binary or a small set of pods.
- LogQL is sufficient for your needs. LogQL supports filtering, pattern matching, aggregation, and metric extraction from logs.
Limitations to understand:
- No full-text index. Searching for a specific error message across all services over a 7-day window is slow because Loki must decompress and scan every chunk that matches the label filter. Narrow your time range and labels, or wait.
- Label cardinality matters. Too many unique label values (like user IDs or request IDs as labels) causes performance degradation and increased storage. Use structured log fields instead.
- Not suitable for security analytics where analysts need to run arbitrary searches across all log content at interactive speed.
- Advanced queries (complex regex across large datasets) can time out or consume significant resources.
Elasticsearch / OpenSearch (ELK Stack)#
Elasticsearch (or its open-source fork, OpenSearch) indexes every word in every log line using an inverted index. This makes full-text search fast at the cost of significant storage and compute. Kibana (or OpenSearch Dashboards) provides visualization and ad-hoc exploration.
Choose Elasticsearch/OpenSearch when:
- Full-text search is a hard requirement. Security teams searching for specific IP addresses, error codes, or patterns across billions of log lines need inverted-index performance.
- You need rich querying: aggregations, faceted search, field-level statistics, and complex boolean queries across log content.
- Compliance or security log analysis requires fast, ad-hoc search across all historical logs.
- Your organization has existing Elasticsearch expertise and operational capacity to manage clusters.
- Kibana’s visualization capabilities (saved searches, dashboards, Lens) match your needs.
Limitations to understand:
- Resource-hungry. Elasticsearch needs significant RAM for the JVM heap (typically 50% of node memory, up to 32GB) and fast disks for the inverted index. A production cluster for moderate log volumes (50-100GB/day) typically needs 3-6 nodes with 32-64GB RAM each.
- Operational complexity is high. Index lifecycle management (ILM), shard sizing, replica configuration, JVM tuning, cluster health monitoring, and rolling upgrades require dedicated attention.
- Storage amplification. The inverted index can be 1-2x the size of the raw data, meaning 1GB of logs can consume 2-3GB of storage.
- Expensive to operate at scale. Both infrastructure costs and engineering time for cluster management add up.
Cloud-Native Log Services#
Each major cloud provider offers a managed log service: AWS CloudWatch Logs, Azure Log Analytics (part of Azure Monitor), and GCP Cloud Logging. These require no infrastructure management and integrate tightly with the provider’s other services.
Choose cloud-native log services when:
- You operate in a single cloud and want zero operational overhead for log management.
- Log volume is low to medium (under ~50GB/day), keeping costs manageable.
- Integration with cloud-native services is important (CloudWatch Logs automatically captures Lambda output, ECS logs, VPC flow logs).
- Compliance requires logs to remain within the cloud provider’s infrastructure.
- Your team does not have Elasticsearch or Loki expertise and does not want to develop it.
Limitations to understand:
- Expensive at high volume. CloudWatch Logs charges $0.50/GB for ingestion and $0.03/GB/month for storage. At 50GB/day, ingestion alone costs $750/month. Azure Log Analytics charges $2.76/GB ingested (with commitment tiers reducing this significantly).
- Query capabilities are limited compared to Elasticsearch or even Loki. CloudWatch Logs Insights is improving but remains less powerful. Azure’s KQL (Kusto Query Language) is actually quite capable.
- Vendor lock-in. Log queries, dashboards, and alerting rules are all defined in the provider’s format and do not port.
- Multi-cloud environments require aggregating logs from multiple provider-specific systems, which is awkward and expensive (cross-region/cross-account data transfer charges).
Vector + ClickHouse#
An emerging pattern uses Vector (a high-performance observability data pipeline from Datadog) as the log collector and ClickHouse (a columnar analytical database) as the storage backend. ClickHouse’s columnar storage and compression achieve excellent query performance and storage efficiency.
Choose Vector + ClickHouse when:
- Log volume is very high (hundreds of GB to TB per day) and cost optimization is critical.
- You need fast analytical queries over structured log data (aggregations, counts, percentiles by field).
- Your team has SQL expertise and prefers SQL over LogQL or Elasticsearch’s query DSL.
- You want a single analytical database that can handle both logs and other analytical workloads.
- You are willing to invest in building and tuning a custom log pipeline.
Limitations to understand:
- This is a newer pattern with a smaller community. There are fewer ready-made dashboards, tutorials, and battle-tested configurations compared to Loki or Elasticsearch.
- More DIY assembly required. You are building a pipeline from components rather than deploying an integrated product.
- ClickHouse operational knowledge is specialized. Cluster management, replication, and schema design for log data require expertise.
- Visualization requires either Grafana with the ClickHouse plugin or building custom tooling.
Log Collector Comparison#
The collector is the agent that reads logs from sources (files, stdout, journal) and ships them to the backend. The choice of collector is somewhat independent of the backend.
| Collector | Primary Backend | Other Backends | Language | Resource Usage | Strengths |
|---|---|---|---|---|---|
| Promtail | Loki | Loki only | Go | Low | Tight Loki integration, label extraction, Kubernetes service discovery |
| Grafana Alloy | Loki, Mimir, Tempo | Many via OTEL | Go | Low-Medium | Successor to Grafana Agent, supports metrics/logs/traces in one agent |
| Fluent Bit | Any | 30+ output plugins | C | Very low (~5MB RAM) | Extremely lightweight, wide output support, good for edge/IoT |
| Fluentd | Any | 700+ plugins | Ruby/C | Medium (~40-100MB) | Massive plugin ecosystem, flexible routing, mature |
| Vector | Any | Many | Rust | Low | High performance, built-in transforms (parsing, filtering, aggregation), type safety |
Promtail if you use Loki and nothing else. It is purpose-built and has the tightest integration.
Fluent Bit if you need the absolute minimum resource footprint or send logs to multiple backends. Its 5MB baseline memory usage makes it suitable for resource-constrained environments.
Fluentd if you need a specific plugin from its massive ecosystem or complex routing logic with buffering and retry semantics.
Vector if performance matters (it consistently benchmarks as the fastest collector), you want built-in parsing and transformation before shipping, or you are using ClickHouse as a backend.
Grafana Alloy if you want a single agent for metrics, logs, and traces in the Grafana ecosystem.
Cost Modeling: 50GB/Day for 30 Days#
| Solution | Monthly Ingestion Cost | Monthly Storage Cost | Infrastructure Cost | Total Estimate |
|---|---|---|---|---|
| Loki (self-managed, S3 backend) | $0 | ~$35 (1.5TB in S3, compressed to ~500GB) | ~$200-400 (compute) | $235-435/month |
| Elasticsearch (self-managed) | $0 | Included in infrastructure | ~$800-1,500 (3-6 nodes, 32GB+ RAM each) | $800-1,500/month |
| CloudWatch Logs | $750 (50GB x $0.50 x 30 days) | ~$45 (1.5TB x $0.03) | $0 | ~$795/month |
| Azure Log Analytics | ~$1,700-3,400 (depends on tier) | Included | $0 | $1,700-3,400/month |
| Vector + ClickHouse (self-managed) | $0 | ~$25 (excellent compression) | ~$300-600 (compute) | $325-625/month |
Loki’s cost advantage comes from not building an inverted index. The compressed chunk storage is dramatically cheaper than Elasticsearch’s indexed storage. ClickHouse’s columnar compression achieves similar storage efficiency with faster analytical queries at the cost of more operational complexity.
The Structured Logging Imperative#
Regardless of which backend you choose, structured JSON logging makes every option work better.
With unstructured logs:
2024-03-15 10:23:45 ERROR Failed to connect to database host=db-primary port=5432 timeout=30sWith structured logs:
{"timestamp":"2024-03-15T10:23:45Z","level":"error","msg":"Failed to connect to database","host":"db-primary","port":5432,"timeout":"30s","service":"api","trace_id":"abc123"}Structured logs enable:
- Loki: Extract fields at query time with LogQL pattern or JSON parsers without storing them as high-cardinality labels.
- Elasticsearch: Automatic field mapping and field-level aggregations without custom grok patterns.
- ClickHouse: Store fields in typed columns for fast analytical queries.
- All backends: Correlation with traces via trace_id fields. Filtering by any field. Consistent parsing without brittle regex.
Invest in structured logging before choosing a backend. It is the single highest-leverage improvement you can make to your logging pipeline.
Decision Summary#
Cost-sensitive, Grafana ecosystem: Loki. Best cost efficiency, seamless Grafana integration, good enough for most query patterns. Accept that unindexed full-text search over large time ranges is slow.
Security/compliance, need full-text search: Elasticsearch or OpenSearch. The inverted index makes arbitrary search fast. Accept the operational burden and infrastructure cost.
Single cloud, minimal operations team: Cloud-native logs. Accept the per-GB costs and vendor lock-in in exchange for zero operational burden.
Very high volume, analytical workloads: Vector + ClickHouse. Best performance per dollar at scale for structured log analysis. Accept the DIY assembly and operational complexity.
Starting fresh: Begin with Loki + Promtail + Grafana. It is the lowest-cost entry point with the most room to grow. If you hit its query limitations, you have learned your actual requirements and can make a more informed decision about Elasticsearch or ClickHouse.