Advanced Ansible Patterns#
As infrastructure grows from a handful of servers to hundreds or thousands, Ansible patterns that worked at small scale become bottlenecks. Playbooks that were simple and readable at 10 hosts become tangled at 100. Roles that were self-contained become duplicated across teams. This framework helps you decide which advanced patterns to adopt and when.
Roles vs Collections#
Roles and collections both organize Ansible content, but they serve different purposes and operate at different scales.
Roles are the basic unit of reusable Ansible content. A role encapsulates tasks, handlers, templates, files, and variables into a directory structure. Roles live inside a project or are shared via Ansible Galaxy.
Collections are the distribution unit for Ansible content. A collection packages multiple roles, modules, plugins, and playbooks under a namespace. Collections are versioned, installable via ansible-galaxy, and can declare dependencies on other collections.
When to use roles alone#
Use roles when your infrastructure is managed by a single team, roles are used within one or two projects, you have fewer than 20 roles, and you do not write custom modules or plugins. At this scale, the overhead of creating and versioning collections adds complexity without proportional benefit.
# Simple project structure with roles
site.yml
inventory/
production/
hosts.yml
group_vars/
staging/
hosts.yml
group_vars/
roles/
webserver/
database/
monitoring/
common/When to move to collections#
Move to collections when roles are shared across multiple projects or teams, you write custom modules, plugins, or filters that need distribution, you need semantic versioning and dependency management for your Ansible content, or your organization has more than 50 roles and needs namespace organization.
# Collection structure
namespace/
infra_platform/
galaxy.yml
roles/
webserver/
database/
monitoring/
plugins/
modules/
custom_deploy.py
filter/
network_utils.py
callback/
custom_logger.py
playbooks/
site.ymlThe galaxy.yml file defines the collection metadata:
# galaxy.yml
namespace: myorg
name: infra_platform
version: 2.1.0
description: Infrastructure platform collection
dependencies:
community.general: ">=6.0.0"
ansible.posix: ">=1.5.0"Build and install the collection:
ansible-galaxy collection build
ansible-galaxy collection install myorg-infra_platform-2.1.0.tar.gzDecision summary#
| Factor | Roles | Collections |
|---|---|---|
| Team count | 1 team | Multiple teams |
| Reuse scope | Within project | Across projects/orgs |
| Custom modules/plugins | No | Yes |
| Versioning needs | Git tags sufficient | Semantic versioning required |
| Distribution | Git clone / Galaxy roles | Galaxy collections / Automation Hub |
| Overhead | Low | Medium |
Dynamic Inventory#
Static inventory files list hosts manually. Dynamic inventory queries an external source (cloud provider API, CMDB, service discovery) to generate the inventory at runtime.
When to use static inventory#
Static inventory works when your infrastructure is stable (hosts rarely change), you manage fewer than 50 hosts, and hosts are provisioned manually or through a slow process. Static inventory is simple, auditable, and version-controlled.
When to switch to dynamic inventory#
Switch to dynamic inventory when hosts are provisioned and destroyed dynamically (auto-scaling groups, cloud VMs, containers), you manage more than 50 hosts and manual updates are error-prone, your source of truth for hosts is already a cloud provider, CMDB, or service discovery system, or you need host grouping based on cloud metadata (tags, regions, instance types).
Cloud provider plugins#
# aws_ec2.yml - AWS dynamic inventory
plugin: amazon.aws.aws_ec2
regions:
- us-east-1
- us-west-2
keyed_groups:
- key: tags.Environment
prefix: env
separator: "_"
- key: instance_type
prefix: type
- key: placement.availability_zone
prefix: az
filters:
tag:ManagedBy: ansible
instance-state-name: running
compose:
ansible_host: private_ip_addressThis generates groups like env_production, env_staging, type_t3_large, and az_us_east_1a automatically from EC2 instance tags and metadata.
# azure_rm.yml - Azure dynamic inventory
plugin: azure.azcollection.azure_rm
auth_source: auto
keyed_groups:
- key: tags.environment | default('untagged')
prefix: env
- key: location
prefix: region
- key: resource_group
prefix: rg
conditional_groups:
webservers: "'web' in tags.role"
databases: "'db' in tags.role"Custom inventory scripts#
When your source of truth is a CMDB, internal API, or custom database, write an inventory plugin:
# plugins/inventory/cmdb_inventory.py
from ansible.plugins.inventory import BaseInventoryPlugin
class InventoryModule(BaseInventoryPlugin):
NAME = 'myorg.infra.cmdb_inventory'
def parse(self, inventory, loader, path, cache=True):
super().parse(inventory, loader, path, cache)
config = self._read_config_data(path)
hosts = self._query_cmdb(config.get('cmdb_url'))
for host in hosts:
self.inventory.add_host(host['hostname'])
self.inventory.set_variable(host['hostname'], 'ansible_host', host['ip'])
for group in host.get('groups', []):
self.inventory.add_group(group)
self.inventory.add_child(group, host['hostname'])Decision summary#
| Factor | Static | Dynamic |
|---|---|---|
| Host count | < 50 | > 50 |
| Host lifecycle | Stable | Dynamic (auto-scaling) |
| Source of truth | Inventory files | Cloud API / CMDB |
| Audit trail | Git history | External system’s audit |
| Setup complexity | None | Medium |
Vault Encryption#
Ansible Vault encrypts sensitive data (passwords, API keys, certificates) so it can be stored in version control alongside playbooks.
Encrypting individual variables vs entire files#
Individual variables (encrypt_string) embed encrypted values inline in YAML files:
ansible-vault encrypt_string 'SuperSecretPassword' --name 'db_password'This produces:
db_password: !vault |
$ANSIBLE_VAULT;1.1;AES256
62313365396662343061393464336163383764316462...Entire file encryption encrypts the full file:
ansible-vault encrypt group_vars/production/secrets.ymlWhen to use each approach#
Use encrypt_string when you have a few secrets mixed with non-secret variables in the same file and you want non-secret values to remain readable in code review. Use entire file encryption when a file contains mostly secrets, you want a clear separation between secret and non-secret files, or you use a vault password file per environment.
Multi-vault strategy#
For environments with different access levels, use multiple vault IDs:
# Encrypt with environment-specific vault IDs
ansible-vault encrypt --vault-id dev@prompt group_vars/dev/secrets.yml
ansible-vault encrypt --vault-id prod@/path/to/prod-vault-pass group_vars/prod/secrets.yml
# Run playbook with multiple vault passwords
ansible-playbook site.yml --vault-id dev@prompt --vault-id prod@/path/to/prod-vault-passThis ensures that developers who know the dev vault password cannot decrypt production secrets. In CI/CD, each environment’s vault password is stored as a separate pipeline secret.
Decision summary for vault strategy#
| Scenario | Approach |
|---|---|
| Small team, one environment | Single vault password, encrypt_string |
| Multiple environments | Vault ID per environment, file encryption |
| CI/CD integration | Vault password files from pipeline secrets |
| Rotation required | Use external secrets manager (HashiCorp Vault, AWS Secrets Manager) with lookup plugins instead of Ansible Vault |
Callback Plugins#
Callback plugins customize Ansible’s output and reporting. They hook into events like task start, task completion, play start, and play end.
Built-in callbacks worth enabling#
# ansible.cfg
[defaults]
callbacks_enabled = ansible.posix.profile_tasks, ansible.posix.timer
[callback_profile_tasks]
task_output_limit = 20
sort_order = descendingprofile_tasks shows execution time per task, making it easy to identify slow tasks. timer shows the total playbook execution time. These are essential for optimizing playbook performance.
Custom callback for notifications#
# plugins/callback/slack_notify.py
from ansible.plugins.callback import CallbackBase
import requests
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'notification'
CALLBACK_NAME = 'slack_notify'
def v2_playbook_on_stats(self, stats):
hosts = sorted(stats.processed.keys())
failures = any(stats.summarize(h).get('failures', 0) > 0 for h in hosts)
message = f"Playbook {'FAILED' if failures else 'completed'}: {len(hosts)} hosts"
requests.post(self.webhook_url, json={"text": message})When to write custom callbacks#
Write custom callbacks when you need to integrate with external notification systems (Slack, PagerDuty, custom dashboards), you want structured logging in a specific format (JSON for log aggregation), or you need to track playbook execution metrics (duration, success rate, change rate) for compliance or auditing.
Custom Modules#
Custom modules extend Ansible’s capabilities when no existing module handles your use case.
When to write a custom module#
Write a custom module when you interact with an internal API that has no community module, you need idempotent management of a resource that the command or shell modules cannot provide, or an existing module does not expose the specific parameters you need.
# plugins/modules/custom_deploy.py
from ansible.module_utils.basic import AnsibleModule
import requests
def main():
module = AnsibleModule(
argument_spec=dict(
app_name=dict(type='str', required=True),
version=dict(type='str', required=True),
api_url=dict(type='str', required=True),
api_token=dict(type='str', required=True, no_log=True),
),
supports_check_mode=True,
)
current = requests.get(
f"{module.params['api_url']}/apps/{module.params['app_name']}",
headers={"Authorization": f"Bearer {module.params['api_token']}"}
).json()
if current.get('version') == module.params['version']:
module.exit_json(changed=False, msg="Already at target version")
if module.check_mode:
module.exit_json(changed=True, msg="Would deploy new version")
response = requests.post(
f"{module.params['api_url']}/apps/{module.params['app_name']}/deploy",
headers={"Authorization": f"Bearer {module.params['api_token']}"},
json={"version": module.params['version']}
)
if response.status_code != 200:
module.fail_json(msg=f"Deploy failed: {response.text}")
module.exit_json(changed=True, version=module.params['version'])
if __name__ == '__main__':
main()Key requirements for custom modules: support check_mode so --check works, use no_log=True for sensitive parameters, return changed=True/False accurately for idempotency, and use module.fail_json() for errors.
Testing with Molecule#
Molecule is the standard testing framework for Ansible roles. It creates ephemeral instances (Docker containers, cloud VMs), applies the role, and runs verification tests.
When to invest in Molecule testing#
Invest in Molecule when roles are shared across teams or projects, roles manage critical infrastructure (databases, security configurations), you have a CI/CD pipeline for Ansible content, or you are building collections for distribution.
Skip Molecule when roles are simple and used by one team, the cost of maintaining test infrastructure exceeds the cost of occasional bugs, or roles are throwaway (one-time migration playbooks).
Basic Molecule setup#
# molecule/default/molecule.yml
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: ubuntu-test
image: ubuntu:22.04
pre_build_image: true
command: /lib/systemd/systemd
privileged: true
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:rw
- name: rocky-test
image: rockylinux:9
pre_build_image: true
command: /lib/systemd/systemd
privileged: true
provisioner:
name: ansible
verifier:
name: ansible# molecule/default/converge.yml
- name: Converge
hosts: all
roles:
- role: webserver
vars:
webserver_port: 8080# molecule/default/verify.yml
- name: Verify
hosts: all
tasks:
- name: Check nginx is installed
ansible.builtin.package:
name: nginx
state: present
check_mode: true
register: nginx_installed
failed_when: nginx_installed.changed
- name: Check nginx is running
ansible.builtin.service:
name: nginx
state: started
check_mode: true
register: nginx_running
failed_when: nginx_running.changed
- name: Check nginx is listening
ansible.builtin.wait_for:
port: 8080
timeout: 5Run the test lifecycle:
molecule create # Create test instances
molecule converge # Apply the role
molecule verify # Run verification tests
molecule destroy # Clean up
molecule test # Run the full lifecycle (create, converge, verify, destroy)CI Integration#
When to add CI for Ansible#
Add CI when multiple people contribute to the Ansible codebase, changes are deployed to production environments, or compliance requires an audit trail of configuration changes.
Pipeline structure#
# .github/workflows/ansible-ci.yml
name: Ansible CI
on:
pull_request:
paths: ['roles/**', 'playbooks/**', 'collections/**']
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run ansible-lint
uses: ansible/ansible-lint@main
molecule:
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
role: [webserver, database, monitoring]
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install ansible molecule molecule-docker
- name: Run Molecule tests
run: molecule test
working-directory: roles/${{ matrix.role }}
deploy-staging:
runs-on: ubuntu-latest
needs: molecule
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Run playbook against staging
run: |
ansible-playbook -i inventory/staging site.yml --diff --check
ansible-playbook -i inventory/staging site.yml --diff
env:
ANSIBLE_VAULT_PASSWORD: ${{ secrets.VAULT_PASSWORD_STAGING }}The pipeline follows a progression: lint first (fast, catches syntax and style issues), then Molecule tests (slower, catches functional issues), then staging deployment (only on merge to main). Production deployment should require manual approval via a separate workflow or deployment tool.
Common Mistakes#
Using collections too early. Creating a collection for three roles used by one team adds build, versioning, and distribution overhead without benefit. Start with roles in a project directory. Move to collections when the sharing and versioning needs justify it.
Dynamic inventory without caching. Every ansible-playbook run queries the cloud API for inventory. For large inventories, this adds seconds or minutes to every run. Enable inventory caching:
[inventory]
cache = true
cache_plugin = ansible.builtin.jsonfile
cache_connection = /tmp/ansible-inventory-cache
cache_timeout = 300Testing the Ansible module, not the result. Molecule verify tasks that check “did Ansible run this task” are tautological. Instead, verify the observable outcome: is the port open, does the service respond, is the file present with the correct content.
Vault password in the repository. The vault password file must never be committed. Add it to .gitignore and distribute it through a secrets manager or CI/CD pipeline secrets. The encrypted files themselves are safe to commit – that is the point of vault encryption.