Advanced Ansible Patterns#

As infrastructure grows from a handful of servers to hundreds or thousands, Ansible patterns that worked at small scale become bottlenecks. Playbooks that were simple and readable at 10 hosts become tangled at 100. Roles that were self-contained become duplicated across teams. This framework helps you decide which advanced patterns to adopt and when.

Roles vs Collections#

Roles and collections both organize Ansible content, but they serve different purposes and operate at different scales.

Roles are the basic unit of reusable Ansible content. A role encapsulates tasks, handlers, templates, files, and variables into a directory structure. Roles live inside a project or are shared via Ansible Galaxy.

Collections are the distribution unit for Ansible content. A collection packages multiple roles, modules, plugins, and playbooks under a namespace. Collections are versioned, installable via ansible-galaxy, and can declare dependencies on other collections.

When to use roles alone#

Use roles when your infrastructure is managed by a single team, roles are used within one or two projects, you have fewer than 20 roles, and you do not write custom modules or plugins. At this scale, the overhead of creating and versioning collections adds complexity without proportional benefit.

# Simple project structure with roles
site.yml
inventory/
  production/
    hosts.yml
    group_vars/
  staging/
    hosts.yml
    group_vars/
roles/
  webserver/
  database/
  monitoring/
  common/

When to move to collections#

Move to collections when roles are shared across multiple projects or teams, you write custom modules, plugins, or filters that need distribution, you need semantic versioning and dependency management for your Ansible content, or your organization has more than 50 roles and needs namespace organization.

# Collection structure
namespace/
  infra_platform/
    galaxy.yml
    roles/
      webserver/
      database/
      monitoring/
    plugins/
      modules/
        custom_deploy.py
      filter/
        network_utils.py
      callback/
        custom_logger.py
    playbooks/
      site.yml

The galaxy.yml file defines the collection metadata:

# galaxy.yml
namespace: myorg
name: infra_platform
version: 2.1.0
description: Infrastructure platform collection
dependencies:
  community.general: ">=6.0.0"
  ansible.posix: ">=1.5.0"

Build and install the collection:

ansible-galaxy collection build
ansible-galaxy collection install myorg-infra_platform-2.1.0.tar.gz

Decision summary#

Factor	Roles	Collections
Team count	1 team	Multiple teams
Reuse scope	Within project	Across projects/orgs
Custom modules/plugins	No	Yes
Versioning needs	Git tags sufficient	Semantic versioning required
Distribution	Git clone / Galaxy roles	Galaxy collections / Automation Hub
Overhead	Low	Medium

Dynamic Inventory#

Static inventory files list hosts manually. Dynamic inventory queries an external source (cloud provider API, CMDB, service discovery) to generate the inventory at runtime.

When to use static inventory#

Static inventory works when your infrastructure is stable (hosts rarely change), you manage fewer than 50 hosts, and hosts are provisioned manually or through a slow process. Static inventory is simple, auditable, and version-controlled.

When to switch to dynamic inventory#

Switch to dynamic inventory when hosts are provisioned and destroyed dynamically (auto-scaling groups, cloud VMs, containers), you manage more than 50 hosts and manual updates are error-prone, your source of truth for hosts is already a cloud provider, CMDB, or service discovery system, or you need host grouping based on cloud metadata (tags, regions, instance types).

Cloud provider plugins#

# aws_ec2.yml - AWS dynamic inventory
plugin: amazon.aws.aws_ec2
regions:
  - us-east-1
  - us-west-2
keyed_groups:
  - key: tags.Environment
    prefix: env
    separator: "_"
  - key: instance_type
    prefix: type
  - key: placement.availability_zone
    prefix: az
filters:
  tag:ManagedBy: ansible
  instance-state-name: running
compose:
  ansible_host: private_ip_address

This generates groups like env_production, env_staging, type_t3_large, and az_us_east_1a automatically from EC2 instance tags and metadata.

# azure_rm.yml - Azure dynamic inventory
plugin: azure.azcollection.azure_rm
auth_source: auto
keyed_groups:
  - key: tags.environment | default('untagged')
    prefix: env
  - key: location
    prefix: region
  - key: resource_group
    prefix: rg
conditional_groups:
  webservers: "'web' in tags.role"
  databases: "'db' in tags.role"

Custom inventory scripts#

When your source of truth is a CMDB, internal API, or custom database, write an inventory plugin:

# plugins/inventory/cmdb_inventory.py
from ansible.plugins.inventory import BaseInventoryPlugin

class InventoryModule(BaseInventoryPlugin):
    NAME = 'myorg.infra.cmdb_inventory'

    def parse(self, inventory, loader, path, cache=True):
        super().parse(inventory, loader, path, cache)
        config = self._read_config_data(path)

        hosts = self._query_cmdb(config.get('cmdb_url'))
        for host in hosts:
            self.inventory.add_host(host['hostname'])
            self.inventory.set_variable(host['hostname'], 'ansible_host', host['ip'])
            for group in host.get('groups', []):
                self.inventory.add_group(group)
                self.inventory.add_child(group, host['hostname'])

Decision summary#

Factor	Static	Dynamic
Host count	< 50	> 50
Host lifecycle	Stable	Dynamic (auto-scaling)
Source of truth	Inventory files	Cloud API / CMDB
Audit trail	Git history	External system’s audit
Setup complexity	None	Medium

Vault Encryption#

Ansible Vault encrypts sensitive data (passwords, API keys, certificates) so it can be stored in version control alongside playbooks.

Encrypting individual variables vs entire files#

Individual variables (encrypt_string) embed encrypted values inline in YAML files:

ansible-vault encrypt_string 'SuperSecretPassword' --name 'db_password'

This produces:

db_password: !vault |
  $ANSIBLE_VAULT;1.1;AES256
  62313365396662343061393464336163383764316462...

Entire file encryption encrypts the full file:

ansible-vault encrypt group_vars/production/secrets.yml

When to use each approach#

Use encrypt_string when you have a few secrets mixed with non-secret variables in the same file and you want non-secret values to remain readable in code review. Use entire file encryption when a file contains mostly secrets, you want a clear separation between secret and non-secret files, or you use a vault password file per environment.

Multi-vault strategy#

For environments with different access levels, use multiple vault IDs:

# Encrypt with environment-specific vault IDs
ansible-vault encrypt --vault-id dev@prompt group_vars/dev/secrets.yml
ansible-vault encrypt --vault-id prod@/path/to/prod-vault-pass group_vars/prod/secrets.yml

# Run playbook with multiple vault passwords
ansible-playbook site.yml --vault-id dev@prompt --vault-id prod@/path/to/prod-vault-pass

This ensures that developers who know the dev vault password cannot decrypt production secrets. In CI/CD, each environment’s vault password is stored as a separate pipeline secret.

Decision summary for vault strategy#

Scenario	Approach
Small team, one environment	Single vault password, encrypt_string
Multiple environments	Vault ID per environment, file encryption
CI/CD integration	Vault password files from pipeline secrets
Rotation required	Use external secrets manager (HashiCorp Vault, AWS Secrets Manager) with lookup plugins instead of Ansible Vault

Callback Plugins#

Callback plugins customize Ansible’s output and reporting. They hook into events like task start, task completion, play start, and play end.

Built-in callbacks worth enabling#

# ansible.cfg
[defaults]
callbacks_enabled = ansible.posix.profile_tasks, ansible.posix.timer

[callback_profile_tasks]
task_output_limit = 20
sort_order = descending

profile_tasks shows execution time per task, making it easy to identify slow tasks. timer shows the total playbook execution time. These are essential for optimizing playbook performance.

Custom callback for notifications#

# plugins/callback/slack_notify.py
from ansible.plugins.callback import CallbackBase
import requests

class CallbackModule(CallbackBase):
    CALLBACK_VERSION = 2.0
    CALLBACK_TYPE = 'notification'
    CALLBACK_NAME = 'slack_notify'

    def v2_playbook_on_stats(self, stats):
        hosts = sorted(stats.processed.keys())
        failures = any(stats.summarize(h).get('failures', 0) > 0 for h in hosts)
        message = f"Playbook {'FAILED' if failures else 'completed'}: {len(hosts)} hosts"
        requests.post(self.webhook_url, json={"text": message})

When to write custom callbacks#

Write custom callbacks when you need to integrate with external notification systems (Slack, PagerDuty, custom dashboards), you want structured logging in a specific format (JSON for log aggregation), or you need to track playbook execution metrics (duration, success rate, change rate) for compliance or auditing.

Custom Modules#

Custom modules extend Ansible’s capabilities when no existing module handles your use case.

When to write a custom module#

Write a custom module when you interact with an internal API that has no community module, you need idempotent management of a resource that the command or shell modules cannot provide, or an existing module does not expose the specific parameters you need.

# plugins/modules/custom_deploy.py
from ansible.module_utils.basic import AnsibleModule
import requests

def main():
    module = AnsibleModule(
        argument_spec=dict(
            app_name=dict(type='str', required=True),
            version=dict(type='str', required=True),
            api_url=dict(type='str', required=True),
            api_token=dict(type='str', required=True, no_log=True),
        ),
        supports_check_mode=True,
    )

    current = requests.get(
        f"{module.params['api_url']}/apps/{module.params['app_name']}",
        headers={"Authorization": f"Bearer {module.params['api_token']}"}
    ).json()

    if current.get('version') == module.params['version']:
        module.exit_json(changed=False, msg="Already at target version")

    if module.check_mode:
        module.exit_json(changed=True, msg="Would deploy new version")

    response = requests.post(
        f"{module.params['api_url']}/apps/{module.params['app_name']}/deploy",
        headers={"Authorization": f"Bearer {module.params['api_token']}"},
        json={"version": module.params['version']}
    )

    if response.status_code != 200:
        module.fail_json(msg=f"Deploy failed: {response.text}")

    module.exit_json(changed=True, version=module.params['version'])

if __name__ == '__main__':
    main()

Key requirements for custom modules: support check_mode so --check works, use no_log=True for sensitive parameters, return changed=True/False accurately for idempotency, and use module.fail_json() for errors.

Testing with Molecule#

Molecule is the standard testing framework for Ansible roles. It creates ephemeral instances (Docker containers, cloud VMs), applies the role, and runs verification tests.

When to invest in Molecule testing#

Invest in Molecule when roles are shared across teams or projects, roles manage critical infrastructure (databases, security configurations), you have a CI/CD pipeline for Ansible content, or you are building collections for distribution.

Skip Molecule when roles are simple and used by one team, the cost of maintaining test infrastructure exceeds the cost of occasional bugs, or roles are throwaway (one-time migration playbooks).

Basic Molecule setup#

# molecule/default/molecule.yml
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: ubuntu-test
    image: ubuntu:22.04
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
  - name: rocky-test
    image: rockylinux:9
    pre_build_image: true
    command: /lib/systemd/systemd
    privileged: true
provisioner:
  name: ansible
verifier:
  name: ansible

# molecule/default/converge.yml
- name: Converge
  hosts: all
  roles:
    - role: webserver
      vars:
        webserver_port: 8080

# molecule/default/verify.yml
- name: Verify
  hosts: all
  tasks:
    - name: Check nginx is installed
      ansible.builtin.package:
        name: nginx
        state: present
      check_mode: true
      register: nginx_installed
      failed_when: nginx_installed.changed

    - name: Check nginx is running
      ansible.builtin.service:
        name: nginx
        state: started
      check_mode: true
      register: nginx_running
      failed_when: nginx_running.changed

    - name: Check nginx is listening
      ansible.builtin.wait_for:
        port: 8080
        timeout: 5

Run the test lifecycle:

molecule create      # Create test instances
molecule converge    # Apply the role
molecule verify      # Run verification tests
molecule destroy     # Clean up
molecule test        # Run the full lifecycle (create, converge, verify, destroy)

CI Integration#

When to add CI for Ansible#

Add CI when multiple people contribute to the Ansible codebase, changes are deployed to production environments, or compliance requires an audit trail of configuration changes.

Pipeline structure#

# .github/workflows/ansible-ci.yml
name: Ansible CI
on:
  pull_request:
    paths: ['roles/**', 'playbooks/**', 'collections/**']

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run ansible-lint
        uses: ansible/ansible-lint@main

  molecule:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        role: [webserver, database, monitoring]
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install ansible molecule molecule-docker
      - name: Run Molecule tests
        run: molecule test
        working-directory: roles/${{ matrix.role }}

  deploy-staging:
    runs-on: ubuntu-latest
    needs: molecule
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - name: Run playbook against staging
        run: |
          ansible-playbook -i inventory/staging site.yml --diff --check
          ansible-playbook -i inventory/staging site.yml --diff
        env:
          ANSIBLE_VAULT_PASSWORD: ${{ secrets.VAULT_PASSWORD_STAGING }}

The pipeline follows a progression: lint first (fast, catches syntax and style issues), then Molecule tests (slower, catches functional issues), then staging deployment (only on merge to main). Production deployment should require manual approval via a separate workflow or deployment tool.

Common Mistakes#

Using collections too early. Creating a collection for three roles used by one team adds build, versioning, and distribution overhead without benefit. Start with roles in a project directory. Move to collections when the sharing and versioning needs justify it.

Dynamic inventory without caching. Every ansible-playbook run queries the cloud API for inventory. For large inventories, this adds seconds or minutes to every run. Enable inventory caching:

[inventory]
cache = true
cache_plugin = ansible.builtin.jsonfile
cache_connection = /tmp/ansible-inventory-cache
cache_timeout = 300

Testing the Ansible module, not the result. Molecule verify tasks that check “did Ansible run this task” are tautological. Instead, verify the observable outcome: is the port open, does the service respond, is the file present with the correct content.

Vault password in the repository. The vault password file must never be committed. Add it to .gitignore and distribute it through a secrets manager or CI/CD pipeline secrets. The encrypted files themselves are safe to commit – that is the point of vault encryption.