How DNS Resolution Works#

When a client requests api.example.com, the resolution follows a chain of queries. The client asks its configured recursive resolver (often the ISP’s, or a public one like 8.8.8.8). The recursive resolver does the heavy lifting: it asks a root name server for .com, the .com TLD server for example.com, and the authoritative name server for example.com returns the answer for api.example.com. Each level caches the result according to the record’s TTL, so subsequent requests short-circuit the chain.

The full resolution path:

Client → Recursive Resolver → Root (.)
                             → TLD (.com)
                             → Authoritative (example.com)
                             → Answer: api.example.com = 203.0.113.10

Understanding this chain matters for debugging. If a DNS change is not propagating, the issue is almost always caching at one of these levels.

Record Types#

A – Maps a hostname to an IPv4 address. The most common record type.

AAAA – Maps a hostname to an IPv6 address. Same as A but for the longer address format.

CNAME – Canonical Name. Points one hostname to another. www.example.com CNAME example.com means “look up whatever example.com resolves to.” Cannot be used at the zone apex (the bare domain example.com).

MX – Mail Exchange. Specifies the mail servers for a domain with priority values. Lower priority number is preferred: example.com MX 10 mail1.example.com is tried before example.com MX 20 mail2.example.com.

TXT – Arbitrary text data. Used for email authentication (SPF, DKIM, DMARC), domain verification (proving you own a domain to Google/AWS/etc.), and Let’s Encrypt DNS-01 challenges.

SRV – Service locator. Specifies host, port, priority, and weight for a service. Used by some protocols (LDAP, SIP) and service discovery systems.

NS – Name Server. Declares which DNS servers are authoritative for a domain. Typically set at the registrar to delegate to your DNS provider.

SOA – Start of Authority. Contains metadata about the zone: primary name server, admin email, serial number, and refresh/retry/expire timers.

PTR – Pointer. Maps an IP address back to a hostname (reverse DNS). Used for email server verification and logging.

CAA – Certificate Authority Authorization. Specifies which CAs are allowed to issue certificates for your domain: example.com CAA 0 issue "letsencrypt.org".

CNAME vs A Records#

A CNAME adds a lookup hop – the resolver must first resolve the CNAME target, then resolve that. For high-traffic endpoints, this extra hop adds latency. More importantly, CNAMEs cannot exist at the zone apex (example.com itself, without any subdomain). The DNS specification forbids CNAME records from coexisting with other record types, and the apex must have SOA and NS records.

Cloud DNS providers work around this with provider-specific record types: Route53 has ALIAS records, Cloudflare has CNAME flattening, and Azure DNS has ALIAS records. These resolve at query time and return A records to the client, avoiding the CNAME limitation.

TTL Strategy#

TTL (Time to Live) controls how long resolvers cache a DNS record, in seconds.

  • High TTL (3600-86400 seconds / 1-24 hours): Fewer queries to your authoritative servers, faster resolution for clients. Use for stable records that rarely change.
  • Low TTL (60-300 seconds / 1-5 minutes): Changes propagate faster. Use before planned migrations or failovers.

The practical workflow for DNS changes:

1. Current state: api.example.com A 203.0.113.10 TTL 3600

2. Lower the TTL (at least one full old-TTL period before the change):
   api.example.com A 203.0.113.10 TTL 60

3. Wait for the old TTL to expire (1 hour in this case)

4. Make the actual change:
   api.example.com A 198.51.100.20 TTL 60

5. After confirming the new target is working, raise the TTL:
   api.example.com A 198.51.100.20 TTL 3600

Skipping step 2 means some resolvers will cache the old record for up to the original TTL, and users will hit the old server for that duration.

DNS in Kubernetes#

Kubernetes runs CoreDNS as the cluster DNS server. Every pod gets a /etc/resolv.conf that points to the CoreDNS service, typically at 10.96.0.10.

Service DNS names follow the pattern <service>.<namespace>.svc.cluster.local. A service called api in namespace production is reachable at api.production.svc.cluster.local.

Kubernetes sets ndots:5 in pod resolv.conf by default:

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

The ndots:5 setting means any name with fewer than 5 dots is treated as a relative name and gets the search domains appended before trying the absolute name. When a pod looks up api.stripe.com (2 dots, less than 5), it generates these queries in order:

api.stripe.com.default.svc.cluster.local  (fails)
api.stripe.com.svc.cluster.local          (fails)
api.stripe.com.cluster.local              (fails)
api.stripe.com.                           (succeeds)

That is three wasted DNS queries for every external domain lookup. For pods that make heavy external API calls, set ndots:2 in the pod spec:

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"

Alternatively, append a trailing dot to external hostnames in your application configuration (api.stripe.com. instead of api.stripe.com) to force absolute lookups.

Cloud DNS Services#

AWS Route53 provides hosted zones with advanced routing policies:

  • Simple: standard DNS resolution, one record per name.
  • Weighted: distribute traffic by percentage (90% to us-east-1, 10% to us-west-2). Useful for canary deployments and gradual migrations.
  • Latency-based: route users to the region with lowest latency. Requires health checks.
  • Failover: active-passive. Route to the primary, switch to secondary if health check fails.
  • Geolocation: route based on the user’s geographic location. Different answer for US vs Europe.
# Create a hosted zone
aws route53 create-hosted-zone --name example.com --caller-reference $(date +%s)

# List records
aws route53 list-resource-record-sets --hosted-zone-id Z1234567890

# Create a simple A record
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "TTL": 300,
        "ResourceRecords": [{"Value": "203.0.113.10"}]
      }
    }]
  }'

GCP Cloud DNS and Azure DNS provide similar functionality with different APIs but equivalent capabilities for standard record management.

DNSSEC#

DNSSEC adds cryptographic signatures to DNS responses, allowing resolvers to verify that answers have not been tampered with in transit. It protects against cache poisoning attacks where an attacker injects false records into a resolver’s cache.

For most internal services and applications, DNSSEC is not necessary – traffic goes over TLS anyway, which provides its own authentication. DNSSEC matters for domains serving email (protects MX lookups), financial services, and high-value public domains. Route53 supports DNSSEC signing for public hosted zones.

DNS Troubleshooting#

dig is the most powerful DNS debugging tool. It shows the full query and response with authority and additional sections.

# Basic lookup
dig api.example.com

# Query a specific DNS server
dig @8.8.8.8 api.example.com

# Show only the answer (scripting)
dig +short api.example.com

# Trace the full resolution chain (root -> TLD -> authoritative)
dig +trace api.example.com

# Query specific record types
dig MX example.com
dig TXT example.com
dig NS example.com

# Check a CNAME chain
dig +trace www.example.com CNAME

# Query from inside a Kubernetes pod
kubectl exec -it debug-pod -- dig api.production.svc.cluster.local

Reading dig output: the ANSWER SECTION contains the resolved records. The AUTHORITY SECTION shows which name servers are authoritative. The status: NXDOMAIN means the domain does not exist. status: SERVFAIL means the authoritative server failed to respond. The Query time shows how long the lookup took – a cached response is typically under 1ms, while a full resolution chain is 50-200ms.

# nslookup for quick checks
nslookup api.example.com
nslookup -type=MX example.com

# host for concise output
host api.example.com
host -t CNAME www.example.com

Common Gotchas#

TTL caching delay: after changing a DNS record, the old record remains cached at resolvers worldwide for up to the previous TTL value. There is no way to force a global cache flush. Always lower TTL before making changes and wait for the old TTL to expire.

ndots:5 in Kubernetes: as described above, this causes 3-4 unnecessary DNS queries for every external lookup. High-traffic services making thousands of external API calls per second can generate enough DNS query volume to overload CoreDNS. Monitor CoreDNS metrics (coredns_dns_requests_total) and set ndots:2 on affected pods.

CNAME at zone apex: you cannot create example.com CNAME something.cdn.com. Standard DNS does not allow it. Use your DNS provider’s ALIAS or ANAME record type, or use an A record pointing directly to the IP. If you are using a CDN or load balancer that changes IPs, the ALIAS record is the correct solution.