PKI Fundamentals#

A Public Key Infrastructure (PKI) is a hierarchy of trust. At the top sits the Root CA, a certificate authority that signs its own certificate and is explicitly trusted by all participants. Below it are Intermediate CAs, signed by the root, which handle day-to-day certificate issuance. At the bottom are leaf certificates, the actual certificates used by servers, clients, and workloads.

Root CA (self-signed, offline, 10-20 year validity)
  |
  +-- Intermediate CA (signed by root, online, 3-5 year validity)
        |
        +-- Leaf Certificate (signed by intermediate, 90 days or less)
        +-- Leaf Certificate
        +-- Leaf Certificate

Never use the root CA directly to sign leaf certificates. If the root CA’s private key is compromised, the entire PKI must be rebuilt from scratch. Keeping it offline and behind an intermediate CA limits the blast radius. If an intermediate CA is compromised, you revoke it and issue a new one from the root – leaf certificates from other intermediates remain valid.

Certificate Lifecycle#

Every certificate follows the same lifecycle: generation (create a private key and CSR), signing (a CA signs the CSR to produce the certificate), distribution (deliver the certificate to the service that needs it), monitoring (track expiry dates and alert before they expire), renewal (generate and sign a new certificate before the old one expires), and revocation (invalidate a certificate before its natural expiry if compromised).

The most common operational failure is letting certificates expire. Expired certificates cause outages that are difficult to diagnose because the error messages vary across TLS implementations and are often unhelpful.

Building an Internal PKI#

Root CA#

Generate the root CA offline. This means on an air-gapped machine or at minimum a machine that will be shut down after the root CA is created. Store the private key in an HSM (Hardware Security Module) or encrypted on offline storage.

# Generate root CA (do this on an air-gapped machine)
openssl genrsa -aes256 -out root-ca.key 4096
openssl req -new -x509 -sha256 -days 7300 -key root-ca.key \
  -out root-ca.crt \
  -subj "/C=US/O=MyOrg/CN=MyOrg Root CA"

The root CA validity should be 10 to 20 years. You will use it infrequently – only to sign or renew intermediate CAs.

Intermediate CA#

The intermediate CA runs online and handles certificate issuance. It is signed by the root CA:

# Generate intermediate CA key and CSR
openssl genrsa -aes256 -out intermediate-ca.key 4096
openssl req -new -sha256 -key intermediate-ca.key \
  -out intermediate-ca.csr \
  -subj "/C=US/O=MyOrg/CN=MyOrg Intermediate CA"

# Sign with root CA (on the air-gapped machine)
openssl x509 -req -sha256 -days 1825 -in intermediate-ca.csr \
  -CA root-ca.crt -CAkey root-ca.key -CAcreateserial \
  -out intermediate-ca.crt \
  -extfile <(printf "basicConstraints=critical,CA:true,pathlen:0\nkeyUsage=critical,keyCertSign,cRLSign")

The pathlen:0 constraint prevents the intermediate from signing other CAs. The intermediate CA validity of 3 to 5 years means you rotate it periodically, but not often enough to be a burden.

Leaf Certificates#

Leaf certificates are signed by the intermediate CA. Keep validity short – 90 days maximum, shorter if automation supports it:

openssl genrsa -out server.key 2048
openssl req -new -sha256 -key server.key -out server.csr \
  -subj "/C=US/O=MyOrg/CN=app.internal"

openssl x509 -req -sha256 -days 90 -in server.csr \
  -CA intermediate-ca.crt -CAkey intermediate-ca.key -CAcreateserial \
  -out server.crt \
  -extfile <(printf "subjectAltName=DNS:app.internal,DNS:app.production.svc.cluster.local,IP:10.0.1.50")

Always include all hostnames and IP addresses in the Subject Alternative Name (SAN) extension. The CN field is deprecated for hostname verification. Modern TLS libraries only check SANs.

cert-manager in Kubernetes#

cert-manager automates certificate lifecycle in Kubernetes. It watches for Certificate resources and handles issuance, renewal, and storage as Kubernetes Secrets.

ClusterIssuer Types#

ACME (Let’s Encrypt) for public-facing services:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: platform@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx

CA issuer for internal services using your own PKI:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca
spec:
  ca:
    secretName: internal-ca-keypair

Vault issuer for integration with Vault PKI:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: vault-issuer
spec:
  vault:
    path: pki_int/sign/internal-certs
    server: https://vault.internal:8200
    auth:
      kubernetes:
        role: cert-manager
        mountPath: /v1/auth/kubernetes
        serviceAccountRef:
          name: cert-manager

Request a certificate by creating a Certificate resource:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-tls
  namespace: production
spec:
  secretName: app-tls-cert
  issuerRef:
    name: internal-ca
    kind: ClusterIssuer
  dnsNames:
    - app.production.svc.cluster.local
    - app.internal
  duration: 720h      # 30 days
  renewBefore: 240h   # renew when 10 days remain

cert-manager renews certificates at two-thirds of their lifetime by default. The renewBefore field overrides this. The renewed certificate is written to the same Secret, and pods with volume-mounted secrets pick it up automatically (after kubelet sync).

Vault PKI Secrets Engine#

Vault’s PKI engine generates certificates dynamically. Instead of managing a static intermediate CA and signing CSRs manually, Vault acts as the intermediate CA and issues certificates on demand:

# Enable PKI engine
vault secrets enable -path=pki_int pki

# Configure intermediate CA (signed by your root)
vault write pki_int/intermediate/generate/internal \
  common_name="MyOrg Intermediate CA" \
  ttl=43800h

# Sign the intermediate CSR with your root CA (offline)
# Then import the signed certificate:
vault write pki_int/intermediate/set-signed \
  certificate=@intermediate-signed.crt

# Create a role for issuing certificates
vault write pki_int/roles/internal-certs \
  allowed_domains="internal,svc.cluster.local" \
  allow_subdomains=true \
  max_ttl=72h

Applications request certificates directly from Vault:

vault write pki_int/issue/internal-certs \
  common_name="app.production.svc.cluster.local" \
  alt_names="app.internal" \
  ttl=24h

Vault returns the certificate, private key, and CA chain. The certificate is valid for 24 hours. There is no need for revocation infrastructure because the certificate expires before you would typically discover a compromise and complete the revocation process.

SPIFFE/SPIRE for Workload Identity#

SPIFFE (Secure Production Identity Framework for Everyone) defines a standard for workload identity. Each workload gets a SPIFFE ID (a URI like spiffe://cluster.local/ns/production/sa/frontend) and an X.509 certificate (called an SVID) that proves it.

SPIRE is the runtime that implements SPIFFE. It runs as a server (which acts as the CA) and agents (on each node, which attest workloads and deliver SVIDs):

# SPIRE server registration entry
apiVersion: spire.spiffe.io/v1alpha1
kind: ClusterSPIFFEID
metadata:
  name: frontend
spec:
  spiffeIDTemplate: "spiffe://{{ .TrustDomain }}/ns/{{ .PodMeta.Namespace }}/sa/{{ .PodSpec.ServiceAccountName }}"
  podSelector:
    matchLabels:
      app: frontend

SVIDs are short-lived (typically 1 hour) and rotated automatically. The workload receives a new certificate before the old one expires, with zero downtime and no manual intervention. This eliminates certificate management entirely – no tracking expiry dates, no renewal workflows, no revocation lists.

Certificate Formats#

PEM: base64-encoded, delimited by -----BEGIN CERTIFICATE----- and -----END CERTIFICATE-----. The most common format on Linux and in Kubernetes. Can concatenate multiple certificates (certificate chain) in a single file.

DER: binary format. Same data as PEM but without base64 encoding. Used in some Java and Windows contexts.

PKCS#12 / PFX: binary format that bundles the certificate, private key, and CA chain into a single password-protected file. Used by Java (imported into keystores) and Windows. Convert from PEM:

openssl pkcs12 -export -out cert.p12 \
  -inkey server.key -in server.crt -certfile intermediate-ca.crt

JKS (Java KeyStore): Java-specific format. Modern Java supports PKCS#12 directly, so prefer that over JKS for new deployments.

Certificate Revocation#

CRL (Certificate Revocation List): the CA publishes a list of revoked certificate serial numbers. Clients download the list periodically and check against it. Problem: the list can be stale between downloads.

OCSP (Online Certificate Status Protocol): the client queries an OCSP responder in real time to check if a specific certificate is revoked. Problem: adds latency to every TLS handshake and the OCSP responder is a single point of failure.

OCSP stapling: the server queries the OCSP responder and includes (staples) the signed response in the TLS handshake. The client gets revocation status without an extra network call. This is the preferred approach for public-facing services.

For internal infrastructure, short-lived certificates are better than revocation. A certificate valid for 24 hours expires faster than you can discover a compromise, investigate, and complete a revocation. Use cert-manager or Vault PKI to automate short-lived issuance.

Monitoring Certificate Expiry#

Prometheus blackbox_exporter probes endpoints and exposes probe_ssl_earliest_cert_expiry as a Unix timestamp:

# Prometheus alert rule
- alert: TLSCertExpiringSoon
  expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "TLS cert for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

cert-manager exposes certmanager_certificate_expiration_timestamp_seconds for every Certificate resource. Alert when the remaining time drops below your renewal threshold.

Manual check with openssl for debugging:

echo | openssl s_client -connect app.internal:443 -servername app.internal 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

Common Gotchas#

Intermediate certificate not included in server response. The server must send the full certificate chain: leaf certificate plus intermediate CA certificate. If only the leaf certificate is sent, clients that do not have the intermediate CA cached will fail verification with errors like “unable to verify the first certificate.” Always configure your web server or ingress to serve the full chain.

Certificate signed for wrong hostname or missing SANs. If the certificate’s SAN list does not include the exact hostname the client is connecting to, TLS verification fails. Include every DNS name and IP address the service is reachable by. For Kubernetes services, this means both the short name (app) and the fully qualified name (app.production.svc.cluster.local).

Vault PKI max_ttl set too long. If you set max_ttl to 1 year on a Vault PKI role, applications will request 1-year certificates, defeating the purpose of dynamic short-lived issuance. Set max_ttl to 72 hours or less for internal services. If an application needs a longer lifetime, that is a signal to reconsider the architecture, not to extend the TTL.