A single-node Kubernetes cluster running seven Helm-managed services concurrently — Gitea, Mattermost, PostgreSQL, kube-prometheus-stack, Jenkins, Temporal, and NATS — looks tractable on paper. The charts are all upstream-maintained. The hardware is modest but adequate. The operational reality is that zero of the seven ran cleanly on out-of-the-box values. Every chart needed at least one customization to coexist with the others, and several needed substantial rewrites of the helm-values surface. This survey catalogs what those customizations are, why each was necessary, and what the common failure modes look like across the fleet.
The frame is “production-shape services on a small cluster” rather than “production cluster, scaled down.” The trade-offs are different. HA dependencies that are free on a multi-node cluster are pure overhead on a single node. Backup discipline that is automatic in a managed offering is a per-service homework assignment. Resource budgets that nobody thinks about at scale dominate every chart-version decision.
A survey of 7 helm-managed services#
The table is the article’s anchor. Every column is load-bearing.
| # | Service | Helm chart | Chart version | ARM64-friendly? | Custom image needed? | Backup mechanism | Per-service customization required | Major gotchas hit |
|---|---|---|---|---|---|---|---|---|
| 1 | Gitea | gitea-charts/gitea | 12.5.3 | yes (rootless image is multi-arch) | no | dedicated cron-driven repo backup script | disable bundled redis-cluster + redis + postgresql + postgresql-ha; point at external PG; admin user inline in values | bundled HA deps light up by default and exhaust resources unless explicitly disabled |
| 2 | Mattermost | mattermost/mattermost-team-edition | 6.6.96 | NO — upstream is amd64-only | YES — built locally from MM ARM64 binary tarball; tag your-registry/mattermost-team-edition:10.5.0-arm64 | none (PV snapshot only) | external PG via env vars + configJSON; bot accounts + access tokens enabled; SiteURL fixed | upstream image lacks ARM64 manifest; QEMU fallback runs but Go runtime crashes (lfstack.push); must bake own image |
| 3 | PostgreSQL | bitnami/postgresql | 18.6.2 (multiple revisions) | yes | no | none yet — backlog gap | initdb.scripts.init-all-dbs.sql creates 4 databases + 4 service roles on first boot; architecture: standalone | Bitnami chart wants HA by default; standalone + initdb is the multi-tenant pattern; PG15+ schema permissions trap |
| 4 | kube-prometheus-stack | prometheus-community/kube-prometheus-stack | 84.3.0 | yes | no | none (recreate from values) | disable kubeEtcd + kubeControllerManager (minikube static pods, no scrape endpoint); disable Grafana persistence (emptyDir for dev); disable NodeClockNotSynchronising + AlertmanagerClusterCrashlooping default rules; full alertmanager routing config inline | mattermost_configs receiver has no title: field — operator config validator silently rejects; chart-default rules produce single-node false-positives; --reuse-values silently ignores -f |
| 5 | Jenkins | jenkins/jenkins | 5.9.18 | yes | YES — your-registry/jenkins:latest with plugins pre-baked via jenkins-plugin-cli | PV snapshot only | controller.installPlugins: [] to skip broken plugin-copy step; JCasC inline (Gitea server, credentials, shared library, organization folder, kubernetes cloud); Docker socket hostPath mount; runs as root for socket access | helm chart’s apply_config.sh has a broken yes n | cp -i ... line that crashes the pod when installPlugins is non-empty; pre-baking is the only stable path |
| 6 | Temporal | temporalio/temporal | 0.74.0 | yes | no | none (state in PG) | configMapsToMount: "sprig" + setConfigFilePath: true (chart default dockerize leaves {{ .Env.* }} unrendered); external PG with separate temporal + temporal_visibility DBs; disable bundled cassandra/mysql/postgresql/elasticsearch/prometheus/grafana | chart 0.74.0 ships server image 1.30.3 which uses sprig, not dockerize — default values produce Cassandra.Hosts: zero value error; chart 1.x restructures persistence keys (pinned to 0.74.0) |
| 7 | NATS | nats/nats | 2.12.6 | yes | no | none (ephemeral pub/sub) | cluster.enabled: false; jetstream.enabled: false (no persistence); single replica; tight memory cap | resource block lives under container.merge.resources not top-level resources — easy to miss |
Every customization in the rightmost column is mandatory for that chart to coexist with the others on a single node. The chart versions are pinned because the gotchas are version-specific — Temporal 1.x restructures persistence keys, the Jenkins apply_config.sh bug surfaces in 5.x, and the kube-prometheus-stack alertmanager schema validation tightened in the 80.x series. Re-verify against the chart you actually install.
Cross-cutting patterns#
Every chart needs at least one customization#
Of the seven surveyed, zero ran cleanly on default values. The reasons cluster into five categories:
- HA dependencies defaulting on: Gitea bundles redis-cluster + redis + postgresql + postgresql-ha. Mattermost bundles MySQL. Temporal bundles cassandra + mysql + elasticsearch + prometheus + grafana. Each one consumes 256-512 MiB of requests before doing any useful work.
- Resource defaults too aggressive for a shared single node: kube-prometheus-stack at defaults requests over 4 GiB by itself. Jenkins requests 2 GiB. Helm-default
resources:blocks across these seven charts sum to over 16 GiB requests — the cluster won’t schedule a single application pod. - Architecture mismatch: Mattermost ships only amd64. QEMU user-mode emulation runs the binary but the Go runtime crashes on
lfstack.push. The fix is rebuilding from the ARM64 binary tarball — see building ARM64 container images when upstream doesn’t ship them and Kubernetes on Apple Silicon setup gotchas. - Chart bugs and broken install paths: Jenkins’s plugin install in
apply_config.shcrashes the pod. Temporal 0.74.0 ships a server image that uses sprig templates while the chart’s defaults assume dockerize. Both are workarounds-required, not fixable from the outside. - Single-node false-positive alerts: kube-prometheus-stack ships rules for VM clock drift and alertmanager cluster crashlooping that fire constantly on a single-node minikube cluster. They have to be disabled at install time or alert fatigue arrives within hours.
The one-line takeaway: plan a values file before you helm install. Treating helm-defaults as a starting point on a small cluster guarantees a wedge state.
Helm defaults assume multi-node production#
Every “free” HA dependency in a helm chart is a memory tax on a single-node cluster. Bundled redis is free on a 5-node cluster because nothing else needs that 256 MiB. On a single node sharing memory with PostgreSQL, Prometheus, Grafana, the application pods themselves, and the kubelet, that 256 MiB has to come from somewhere. The shape of the trade-off is identical for every chart surveyed: the bundled dependency is convenient, defaults to on, and is the first thing to disable.
The principle generalizes: read the values.yaml top to bottom before installing any chart on a constrained cluster. Look for enabled: true on anything labeled redis, postgresql, mysql, cassandra, elasticsearch, or prometheus. Most of them want to be off.
Resource sizing is a budget, not a target#
Helm resources: blocks across these seven charts default to numbers that assume infinite headroom. The actual budget is finite and shared. The discipline is to set requests and limits per service that sum to less than the cluster’s allocatable memory, with headroom for application workloads. See the resource budgeting section below for concrete numbers.
Authentication has three layers#
The seven services together use three distinct credential patterns, and treating them uniformly leads to mistakes:
- Helm-values inline admin credentials (Gitea, Jenkins, Grafana). Convenient. Leaks into git history. Fine for dev clusters with a
<dev-password>placeholder; useexistingSecretreferences for any cluster reachable from outside localhost. - Per-service service users (PostgreSQL roles, Mattermost user, Gitea user). Set via initdb scripts or chart
configJSON. Survive helm upgrades. Don’t change unless you intend to. - Per-app tokens (Gitea API tokens, Mattermost bot tokens, Slack/Mattermost webhook URLs). Always live in K8s Secrets, mounted via
secrets:(the alertmanager pattern) or env-from-secret. Never inline in helm-values, even in dev — they tend to outlive the dev cluster.
Chart structure varies — read it before customizing#
Each chart organizes its values differently, and the difference matters when overriding:
- NATS puts container resources under
container.merge.resources, not top-levelresources. Setting top-level does nothing. - Mattermost uses both
extraEnvVarsANDconfigJSONfor the same SQL settings. Both must agree or the pod refuses to start. - Temporal nests every server component (
frontend,history,matching,worker) with its ownreplicaCountandresources. Setting one doesn’t set the others. - kube-prometheus-stack has
prometheus.prometheusSpec.resources(the CRD-mode operator-managed pod) ANDprometheusOperator.resources(the operator pod itself). They are separate budgets.
helm get values <release> after install is the only reliable way to confirm that an override took effect. See helm gotchas: reuse-values, revisions, rollback for why this matters.
Single-node-specific overrides#
A consistent set of overrides applies across charts when the target is single-node:
- Disable
kubeEtcdandkubeControllerManagerscrapes (minikube runs them as static pods with no scrape endpoint). - Disable
NodeClockNotSynchronisingrule (minikube/Docker Desktop VMs drift constantly; the alert is a false positive). - Disable
AlertmanagerClusterCrashlooping(single replica means no cluster, the rule fires forever). - Set
imagePullPolicy: Neverfor any locally-built image. - Disable
initChownDatafor Grafana — minikube hostPath PVs don’t need it.
These are not generic best practices; they’re single-node-specific. Carry them forward as a checklist for any chart added to the same cluster.
Resource budgeting under a memory cap#
Pick a memory cap that matches the hardware. The example below uses 24 GiB — appropriate for a Mac mini-class workstation running Docker Desktop. The arithmetic is the same for any cap.
service requests (cpu / mem) limits (mem)
gitea 100m / 128Mi 512Mi
mattermost 250m / 512Mi 1Gi
postgresql 250m / 512Mi 2Gi
prometheus 200m / 512Mi 1Gi
grafana 100m / 128Mi 512Mi
alertmanager 50m / 64Mi 256Mi
prom-operator 100m / 256Mi 512Mi
jenkins 250m / 1Gi 2Gi
temporal (4 svc) 400m / 768Mi (sum) 1.5Gi (sum)
nats 50m / 64Mi 128Mi
TOTAL requests: ~1750m CPU / ~4.2 GiB memory
TOTAL limits: ~9.4 GiB memoryThe 4.2 GiB requests sum is what the cluster scheduler reserves before any application workload arrives. With a 24 GiB cap, that leaves roughly 20 GiB for application pods plus minikube node overhead — enough headroom for a meaningful workload. Going to helm-defaults on every chart blows past 16 GiB of requests alone, before any application pod is scheduled. The cluster wedges.
The arithmetic forces three decisions early:
- Requests must be tight. The number that gets reserved is
requests, notlimits. Defaultrequestsare usually conservative for production and wasteful for single-node. Halve them and watch behavior under load before halving again. - Limits should reflect peak, not average. PostgreSQL at idle uses 100 MiB; under a backfill query it uses 1.5 GiB. The
limitsslot exists to allow that peak without OOMKilling. - Multi-tenant > N instances. A single Bitnami PostgreSQL with
initdbcreating four databases + four roles uses 512 MiB. Four chart-bundled PostgreSQL instances use 4 × 512 MiB. The math forces consolidation.
The multi-tenant PG pattern looks like:
-- initdb.scripts.init-all-dbs.sql (excerpt)
SELECT 'CREATE DATABASE temporal'
WHERE NOT EXISTS (SELECT FROM pg_database WHERE datname = 'temporal')\gexec
-- ...repeated for temporal_visibility, mattermost, gitea, etc.
DO $$
BEGIN
IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = 'temporal') THEN
CREATE ROLE temporal WITH LOGIN PASSWORD '<dev-password>';
END IF;
-- ...repeated per service
END $$;
GRANT ALL PRIVILEGES ON DATABASE temporal TO temporal;
-- ...repeatedThe trade-off accepted: shared PostgreSQL is also the single point of failure. That’s acceptable for dev/homelab and unacceptable for any production posture. Plan the migration to per-service or HA PG before the cluster carries production traffic.
Common failure modes and what they tell you#
The same five failure signatures recur across the surveyed charts. Recognizing the signature shortcuts diagnosis.
CrashLoopBackOff#
Back-off restarting failed container <name> in pod <pod>Three common causes, each with a distinctive log signature:
Image arch mismatch (Mattermost-class). The pod starts. The runtime aborts deep in Go’s lock-free stack:
runtime: failed to create new OS thread (have N already; errno=22)
fatal error: lfstack.pushThere is no fix from the outside. Build a native ARM64 image and reference it.
OOMKilled. The limit is too low for steady-state:
State: Terminated, Reason: OOMKilled, ExitCode: 137kubectl describe pod confirms the reason. Either raise the limit or find what’s consuming the unexpected memory. PostgreSQL after a schema-change run, Jenkins after a long build queue, Prometheus after a series-cardinality spike — all common culprits.
Config error from a chart bug. Jenkins shows:
apply_config.sh: line N: cp: cannot stat ...— the broken plugin-copy path triggered by a non-empty installPlugins. Fix: empty the list, pre-bake plugins into the image. Temporal shows:
Persistence.DataStores[default].Cassandra.Hosts: zero value— the chart-default dockerize leaves {{ .Env.CASSANDRA_HOSTS }} unrendered because the server image uses sprig. Fix: configMapsToMount: "sprig" and setConfigFilePath: true.
ImagePullBackOff and ErrImageNeverPull#
Failed to pull image "your-registry/mattermost-team-edition:10.5.0-arm64": ... not foundFor locally-built images on minikube, two causes dominate:
imagePullPolicynot set toNever(orIfNotPresent). Kubernetes tries the registry, gets nothing, fails.eval $(minikube docker-env)was not run beforedocker build. The image landed in the host Docker daemon, not minikube’s.docker imagesfrom the wrong context confirms it.
Pending pods, no schedulable node#
0/1 nodes are available: 1 Insufficient memory.Single node + every chart at helm-default = wedge state. Diagnose with:
kubectl describe nodes | grep -A 10 "Allocated resources"The fix is to lower requests, not raise the cap. Raising the cap pushes the same problem out by one dependency.
Operator silent rejection of alertmanager config#
The smoking gun is in the operator logs, not the alertmanager logs. Alerts never deliver, alertmanager looks healthy, the chart shows deployed. The operator is rejecting the config:
Sync error: failed to apply alertmanager config: unknown field "title" in mattermost_configsThe alertmanager pod runs the previous valid config and accepts no updates. The fix is to remove the offending field — title does not exist in mattermost_configs; fold any title into the text: body. See Prometheus stack alertmanager operations for the deeper dive.
helm upgrade --reuse-values silently ignoring -f#
helm upgrade <release> ... --reuse-values -f values.yaml # WRONG: -f is silently ignored
helm upgrade <release> ... -f values.yaml # CORRECTNo warning printed. No error. The chart redeploys with the previous values. Always verify with:
helm get values <release> -n <namespace>This trap accounts for a disproportionate share of “I changed the values and nothing happened” debugging sessions. See helm gotchas: reuse-values, revisions, rollback.
Backup discipline as a per-service problem#
Backup posture across the seven services is uneven, and the unevenness is itself worth naming.
| Service | Backup status |
|---|---|
| Gitea | dedicated cron-driven script (best in fleet) |
| PostgreSQL | gap — no scheduled dumps; PV snapshot only |
| Mattermost | gap — file uploads on PV, no off-cluster copy |
| Jenkins | gap — JENKINS_HOME on PV, plugins re-bakeable but jobs are not |
| Prometheus | acceptable — TSDB recoverable from rules |
| Temporal | partial — workflow state in PG (covered when PG is) |
| NATS | n/a — ephemeral |
The PostgreSQL gap is the most consequential. Five of the seven services depend on shared PostgreSQL for state. A PG loss takes Temporal workflow history, Mattermost messages, Gitea metadata, application data, and any service-specific data with it. The best-case backup posture across the fleet is exactly as good as PostgreSQL’s, and PG has no scheduled dumps yet.
The general lesson: backups are a per-service discipline, not a per-cluster one. “We snapshot the volumes” papers over the question. Per-service it becomes “what is the recovery procedure for THIS service’s state?” The PV-snapshot answer rarely survives that translation. See single-node Kubernetes disaster recovery for the recovery-procedure side of the same problem.
When to vendor your own image#
Five of the seven services run upstream images. Two — Mattermost and Jenkins — required vendoring. The decision pattern:
| Service | Choice | Why |
|---|---|---|
| Gitea | upstream (rootless) | publishes ARM64; rootless avoids permission grief on hostPath PVs |
| Mattermost | vendor own | no ARM64 image upstream; QEMU emulation crashes Go runtime; only path is rebuild from binary tarball |
| PostgreSQL | upstream (Bitnami) | publishes multi-arch; chart is mature; standalone mode well-supported |
| kube-prometheus-stack | upstream | massive chart with deep CRD coupling; forking would mean fork-forever |
| Jenkins | vendor own | bundled plugin-install step in apply_config.sh is broken; pre-baking via jenkins-plugin-cli is upstream-recommended for prod anyway |
| Temporal | upstream (pinned to 0.74.0) | chart works after configMapsToMount: sprig flip; 1.x major restructure deferred |
| NATS | upstream | small, simple, just works |
The vendor-own decision criterion has two halves: (a) upstream doesn’t ship the architecture you need, OR (b) the chart’s runtime install path is broken in a way that’s not fixable from the outside. Mattermost is case (a). Jenkins is case (b). Both produce a Dockerfile that’s measured in tens of lines, not hundreds:
# Mattermost ARM64 (sketch)
FROM ubuntu:22.04
ARG MM_VERSION=10.5.0
RUN curl -L https://releases.mattermost.com/${MM_VERSION}/mattermost-${MM_VERSION}-linux-arm64.tar.gz \
| tar xz -C /opt/
# ...user, entrypoint, etc.# Jenkins with pre-baked plugins
FROM jenkins/jenkins:lts
COPY plugins.txt /usr/share/jenkins/ref/
RUN jenkins-plugin-cli --plugin-file /usr/share/jenkins/ref/plugins.txtThe decision NOT to fork the helm chart matters as much as the decision to vendor the image. Every service except Mattermost and Jenkins fits in fewer than 60 lines of values.yaml. Forking trades a 50-line values file for a chart you now maintain. Chart-version drift outpaces a fork’s value within two or three upstream releases.
Anti-patterns#
A handful of patterns recur often enough to be worth naming as anti-patterns:
- “Use the helm chart’s bundled PostgreSQL.” Fine for one service. Deadly across seven. Multi-tenant single PG with
initdbcreating per-service databases halves storage requests and gives a single backup target. - “Set
--reuse-valuesbecause-fshould be additive.” Silent override. Always verify withhelm get values. - “Skip the ARM64 check, QEMU will handle it.” Works for shell utilities. Fails on Go binaries. The crash signature is
lfstack.pushdeep in the Go runtime; there is no application-level fix. - “Install Jenkins plugins at runtime via the helm chart’s
installPlugins:.” The chart’sapply_config.shis broken. Pre-bake plugins into the image. - “Trust the alertmanager config validator.” The operator silently rejects unknown fields. Verify by tailing operator logs after every config change.
- “Helm-default
resources:are sane defaults.” They’re sane for production multi-node clusters. On a single node they sum to a wedge state.
Quotable lessons#
- Every Helm chart needs at least one customization on a single-node cluster. Plan a values file before you
helm install. - Helm defaults are written for production multi-node clusters. On a small cluster every “free” HA dependency is a memory tax.
- If a service publishes no ARM64 image, you’ll be vendoring your own. There is no QEMU shortcut for Go binaries.
- A multi-tenant PostgreSQL with
initdbscripts beats N bundled PG instances by an order of magnitude in memory cost. - Backups are a per-service discipline, not a per-cluster one. Track each service’s plan separately or it slips.
- When
helm upgradedoesn’t take effect, checkhelm get valuesfirst.--reuse-valuessilently overrides-f. - Pre-bake Jenkins plugins. The Helm chart’s runtime install path is fragile and reduces every deploy to a coin flip.
Where this article fits#
This is the meta-survey: seven services side-by-side, the cross-cutting patterns that show up only when they’re operated together, and the failure modes that span charts. For per-service depth:
- Self-hosting Gitea on Kubernetes — chart 1, the rootless image and external-PG pattern.
- Building ARM64 container images when upstream doesn’t ship them — chart 2, the Mattermost custom-image build.
- Prometheus stack alertmanager operations — chart 4, the alertmanager routing and validator-rejection trap.
- Helm gotchas: reuse-values, revisions, rollback — the cross-cutting helm operational patterns.
- Kubernetes on Apple Silicon setup gotchas — the substrate this whole survey runs on.
- Single-node Kubernetes disaster recovery — the backup-and-recovery posture the gap analysis above demands.
Read this first to understand the shape of the problem; read the per-service articles when a specific chart needs depth.