Choosing a Log Aggregation Stack#

Logs are the most fundamental observability signal. Every application produces them, every incident investigation starts with them, and every compliance framework requires retaining them. The challenge is not collecting logs – it is storing, indexing, querying, and retaining them at scale without spending a fortune.

The choice of log aggregation stack determines your query speed, operational burden, storage costs, and how effectively you can correlate logs with metrics and traces during incident response.

Decision Criteria#

Before evaluating tools, establish your requirements:

Log volume: 1GB/day and 1TB/day require fundamentally different architectures. The tool that works at 10GB/day may collapse at 500GB/day.
Query patterns: Do you need full-text search across all log content? Or do you mostly filter by known labels (service name, severity, pod) and then grep within results?
Retention period: 7 days, 30 days, 1 year? Retention multiplied by daily volume determines storage costs.
Correlation: Do you need to jump from a log line to the corresponding metric or trace? Tight integration between logs, metrics, and traces reduces mean time to resolution.
Operational capacity: Do you have a team that can manage Elasticsearch clusters, or do you need something that runs without constant attention?
Compliance: Do regulations require specific retention periods, immutable storage, or data residency?

Options at a Glance#

Capability	Loki	Elasticsearch/OpenSearch	Cloud-Native Logs	Vector + ClickHouse
Indexing approach	Label-based (metadata only)	Full-text (inverted index on content)	Full-text (varies)	Column-oriented analytical
Query language	LogQL	KQL / Lucene / ES Query DSL	Vendor-specific	SQL (ClickHouse dialect)
Full-text search	Grep-like (filter then scan)	Native, fast	Yes	Yes (with full-text index)
Query speed (known labels)	Fast	Fast	Fast	Very fast
Query speed (unindexed search)	Slow over large ranges	Fast (everything is indexed)	Moderate	Fast (columnar compression)
Operational complexity	Low-Medium	High	None	Medium-High
Storage efficiency	Very high (compressed chunks)	Low (inverted index is large)	Managed	High (columnar compression)
Correlation with metrics	Native (Grafana, same labels)	Requires configuration	Vendor-dependent	Requires configuration
Cost model	Infrastructure only	Infrastructure only	Per GB ingested + stored	Infrastructure only

Grafana Loki#

Loki is a log aggregation system designed by Grafana Labs. It indexes only metadata labels (like Prometheus labels), not the content of log lines. Log content is stored as compressed chunks in object storage (S3, GCS, filesystem). Queries filter by labels first, then grep through the matching chunks.

Choose Loki when:

Cost is a primary driver. By not indexing log content, Loki uses dramatically less storage and compute than Elasticsearch for the same volume of logs.
You already use Prometheus and Grafana. Loki uses the same label model as Prometheus, so you can correlate logs and metrics using identical label sets in the same Grafana dashboard.
Your query pattern is “filter by service/namespace/pod, then search within those logs.” This is how most incident investigations work, and Loki handles it well.
You want a system that is simple to operate compared to Elasticsearch. In monolithic or simple-scalable mode, Loki is a single binary or a small set of pods.
LogQL is sufficient for your needs. LogQL supports filtering, pattern matching, aggregation, and metric extraction from logs.

Limitations to understand:

No full-text index. Searching for a specific error message across all services over a 7-day window is slow because Loki must decompress and scan every chunk that matches the label filter. Narrow your time range and labels, or wait.
Label cardinality matters. Too many unique label values (like user IDs or request IDs as labels) causes performance degradation and increased storage. Use structured log fields instead.
Not suitable for security analytics where analysts need to run arbitrary searches across all log content at interactive speed.
Advanced queries (complex regex across large datasets) can time out or consume significant resources.

Elasticsearch / OpenSearch (ELK Stack)#

Elasticsearch (or its open-source fork, OpenSearch) indexes every word in every log line using an inverted index. This makes full-text search fast at the cost of significant storage and compute. Kibana (or OpenSearch Dashboards) provides visualization and ad-hoc exploration.

Choose Elasticsearch/OpenSearch when:

Full-text search is a hard requirement. Security teams searching for specific IP addresses, error codes, or patterns across billions of log lines need inverted-index performance.
You need rich querying: aggregations, faceted search, field-level statistics, and complex boolean queries across log content.
Compliance or security log analysis requires fast, ad-hoc search across all historical logs.
Your organization has existing Elasticsearch expertise and operational capacity to manage clusters.
Kibana’s visualization capabilities (saved searches, dashboards, Lens) match your needs.

Limitations to understand:

Resource-hungry. Elasticsearch needs significant RAM for the JVM heap (typically 50% of node memory, up to 32GB) and fast disks for the inverted index. A production cluster for moderate log volumes (50-100GB/day) typically needs 3-6 nodes with 32-64GB RAM each.
Operational complexity is high. Index lifecycle management (ILM), shard sizing, replica configuration, JVM tuning, cluster health monitoring, and rolling upgrades require dedicated attention.
Storage amplification. The inverted index can be 1-2x the size of the raw data, meaning 1GB of logs can consume 2-3GB of storage.
Expensive to operate at scale. Both infrastructure costs and engineering time for cluster management add up.

Cloud-Native Log Services#

Each major cloud provider offers a managed log service: AWS CloudWatch Logs, Azure Log Analytics (part of Azure Monitor), and GCP Cloud Logging. These require no infrastructure management and integrate tightly with the provider’s other services.

Choose cloud-native log services when:

You operate in a single cloud and want zero operational overhead for log management.
Log volume is low to medium (under ~50GB/day), keeping costs manageable.
Integration with cloud-native services is important (CloudWatch Logs automatically captures Lambda output, ECS logs, VPC flow logs).
Compliance requires logs to remain within the cloud provider’s infrastructure.
Your team does not have Elasticsearch or Loki expertise and does not want to develop it.

Limitations to understand:

Expensive at high volume. CloudWatch Logs charges $0.50/GB for ingestion and $0.03/GB/month for storage. At 50GB/day, ingestion alone costs $750/month. Azure Log Analytics charges $2.76/GB ingested (with commitment tiers reducing this significantly).
Query capabilities are limited compared to Elasticsearch or even Loki. CloudWatch Logs Insights is improving but remains less powerful. Azure’s KQL (Kusto Query Language) is actually quite capable.
Vendor lock-in. Log queries, dashboards, and alerting rules are all defined in the provider’s format and do not port.
Multi-cloud environments require aggregating logs from multiple provider-specific systems, which is awkward and expensive (cross-region/cross-account data transfer charges).

Vector + ClickHouse#

An emerging pattern uses Vector (a high-performance observability data pipeline from Datadog) as the log collector and ClickHouse (a columnar analytical database) as the storage backend. ClickHouse’s columnar storage and compression achieve excellent query performance and storage efficiency.

Choose Vector + ClickHouse when:

Log volume is very high (hundreds of GB to TB per day) and cost optimization is critical.
You need fast analytical queries over structured log data (aggregations, counts, percentiles by field).
Your team has SQL expertise and prefers SQL over LogQL or Elasticsearch’s query DSL.
You want a single analytical database that can handle both logs and other analytical workloads.
You are willing to invest in building and tuning a custom log pipeline.

Limitations to understand:

This is a newer pattern with a smaller community. There are fewer ready-made dashboards, tutorials, and battle-tested configurations compared to Loki or Elasticsearch.
More DIY assembly required. You are building a pipeline from components rather than deploying an integrated product.
ClickHouse operational knowledge is specialized. Cluster management, replication, and schema design for log data require expertise.
Visualization requires either Grafana with the ClickHouse plugin or building custom tooling.

Log Collector Comparison#

The collector is the agent that reads logs from sources (files, stdout, journal) and ships them to the backend. The choice of collector is somewhat independent of the backend.

Collector	Primary Backend	Other Backends	Language	Resource Usage	Strengths
Promtail	Loki	Loki only	Go	Low	Tight Loki integration, label extraction, Kubernetes service discovery
Grafana Alloy	Loki, Mimir, Tempo	Many via OTEL	Go	Low-Medium	Successor to Grafana Agent, supports metrics/logs/traces in one agent
Fluent Bit	Any	30+ output plugins	C	Very low (~5MB RAM)	Extremely lightweight, wide output support, good for edge/IoT
Fluentd	Any	700+ plugins	Ruby/C	Medium (~40-100MB)	Massive plugin ecosystem, flexible routing, mature
Vector	Any	Many	Rust	Low	High performance, built-in transforms (parsing, filtering, aggregation), type safety

Promtail if you use Loki and nothing else. It is purpose-built and has the tightest integration.

Fluent Bit if you need the absolute minimum resource footprint or send logs to multiple backends. Its 5MB baseline memory usage makes it suitable for resource-constrained environments.

Fluentd if you need a specific plugin from its massive ecosystem or complex routing logic with buffering and retry semantics.

Vector if performance matters (it consistently benchmarks as the fastest collector), you want built-in parsing and transformation before shipping, or you are using ClickHouse as a backend.

Grafana Alloy if you want a single agent for metrics, logs, and traces in the Grafana ecosystem.

Cost Modeling: 50GB/Day for 30 Days#

Solution	Monthly Ingestion Cost	Monthly Storage Cost	Infrastructure Cost	Total Estimate
Loki (self-managed, S3 backend)	$0	~$35 (1.5TB in S3, compressed to ~500GB)	~$200-400 (compute)	$235-435/month
Elasticsearch (self-managed)	$0	Included in infrastructure	~$800-1,500 (3-6 nodes, 32GB+ RAM each)	$800-1,500/month
CloudWatch Logs	$750 (50GB x $0.50 x 30 days)	~$45 (1.5TB x $0.03)	$0	~$795/month
Azure Log Analytics	~$1,700-3,400 (depends on tier)	Included	$0	$1,700-3,400/month
Vector + ClickHouse (self-managed)	$0	~$25 (excellent compression)	~$300-600 (compute)	$325-625/month

Loki’s cost advantage comes from not building an inverted index. The compressed chunk storage is dramatically cheaper than Elasticsearch’s indexed storage. ClickHouse’s columnar compression achieves similar storage efficiency with faster analytical queries at the cost of more operational complexity.

The Structured Logging Imperative#

Regardless of which backend you choose, structured JSON logging makes every option work better.

With unstructured logs:

2024-03-15 10:23:45 ERROR Failed to connect to database host=db-primary port=5432 timeout=30s

With structured logs:

{"timestamp":"2024-03-15T10:23:45Z","level":"error","msg":"Failed to connect to database","host":"db-primary","port":5432,"timeout":"30s","service":"api","trace_id":"abc123"}

Structured logs enable:

Loki: Extract fields at query time with LogQL pattern or JSON parsers without storing them as high-cardinality labels.
Elasticsearch: Automatic field mapping and field-level aggregations without custom grok patterns.
ClickHouse: Store fields in typed columns for fast analytical queries.
All backends: Correlation with traces via trace_id fields. Filtering by any field. Consistent parsing without brittle regex.

Invest in structured logging before choosing a backend. It is the single highest-leverage improvement you can make to your logging pipeline.

Decision Summary#

Cost-sensitive, Grafana ecosystem: Loki. Best cost efficiency, seamless Grafana integration, good enough for most query patterns. Accept that unindexed full-text search over large time ranges is slow.

Security/compliance, need full-text search: Elasticsearch or OpenSearch. The inverted index makes arbitrary search fast. Accept the operational burden and infrastructure cost.

Single cloud, minimal operations team: Cloud-native logs. Accept the per-GB costs and vendor lock-in in exchange for zero operational burden.

Very high volume, analytical workloads: Vector + ClickHouse. Best performance per dollar at scale for structured log analysis. Accept the DIY assembly and operational complexity.

Starting fresh: Begin with Loki + Promtail + Grafana. It is the lowest-cost entry point with the most room to grow. If you hit its query limitations, you have learned your actual requirements and can make a more informed decision about Elasticsearch or ClickHouse.