Data Classification and Handling#

Data classification assigns sensitivity levels to data and maps those levels to specific handling requirements — who can access it, how it is encrypted, where it can be stored, how long it is retained, and how it is disposed of. Without classification, every piece of data gets the same (usually insufficient) protection, or security is applied inconsistently based on individual judgment.

Defining Classification Tiers#

Most organizations need four tiers. Fewer leads to overly broad categories. More leads to confusion about which tier applies.

Tier 1: Public#

Data intended for public consumption. No confidentiality requirement.

Examples:
  - Marketing content, blog posts, public documentation
  - Open-source code
  - Published financial reports
  - Public API responses (non-authenticated)

Handling:
  - Encryption in transit: TLS (standard)
  - Encryption at rest: Optional
  - Access control: None required
  - Retention: Per business need
  - Disposal: No special requirements

Tier 2: Internal#

Data for internal use that would not cause significant harm if disclosed, but is not intended for public access.

Examples:
  - Internal documentation, meeting notes
  - Non-sensitive business metrics
  - Employee directory (name, title, email)
  - Internal tool configurations

Handling:
  - Encryption in transit: TLS 1.2+ (required)
  - Encryption at rest: Recommended (server-side encryption)
  - Access control: Authentication required
  - Retention: 3 years default, per policy
  - Disposal: Standard deletion (no special wiping)

Tier 3: Confidential#

Data that could cause material harm to the organization or individuals if disclosed. Most business data falls here.

Examples:
  - Customer PII (names, emails, addresses, phone numbers)
  - Financial records, invoices, contracts
  - Source code for proprietary products
  - Business strategies, roadmaps, M&A plans
  - Employee HR records, compensation data

Handling:
  - Encryption in transit: TLS 1.2+ (required, mTLS for service-to-service)
  - Encryption at rest: Required (AES-256, KMS-managed keys)
  - Access control: RBAC with need-to-know basis
  - Logging: All access logged and monitored
  - Retention: Per regulation (GDPR, CCPA, HIPAA) or 7 years default
  - Disposal: Cryptographic erasure or secure deletion
  - Location: Approved regions only (per data residency policy)

Tier 4: Restricted#

Data requiring the highest level of protection. Disclosure could cause severe harm, legal liability, or regulatory violation.

Examples:
  - Passwords, API keys, encryption keys, secrets
  - Payment card data (PCI scope)
  - Protected Health Information (PHI / HIPAA scope)
  - Government classified data (CUI, ITAR)
  - Biometric data
  - Social Security Numbers, national ID numbers

Handling:
  - Encryption in transit: TLS 1.3 with mTLS (required)
  - Encryption at rest: Required (AES-256, customer-managed KMS keys)
  - Field-level encryption: Required for specific fields (SSN, card numbers)
  - Access control: Strict RBAC, MFA, just-in-time access
  - Logging: All access logged, monitored, and alerted
  - Retention: Minimum required by regulation, dispose as soon as possible
  - Disposal: Cryptographic erasure with verification
  - Location: Specific approved regions only
  - Separation: Isolated storage, separate databases/encryption keys

Labeling Data#

Classification is useless if data is not labeled. Labels must be machine-readable for automated enforcement.

Database-Level Labeling#

Add classification metadata to schemas:

-- Table-level classification
COMMENT ON TABLE customers IS 'classification:confidential';
COMMENT ON TABLE payment_methods IS 'classification:restricted';

-- Column-level classification
COMMENT ON COLUMN customers.email IS 'classification:confidential,pii:true';
COMMENT ON COLUMN customers.name IS 'classification:confidential,pii:true';
COMMENT ON COLUMN payment_methods.card_last_four IS 'classification:restricted,pci:true';

Application-Level Labeling#

Tag data objects in code:

from dataclasses import dataclass, field
from enum import Enum

class Classification(Enum):
    PUBLIC = "public"
    INTERNAL = "internal"
    CONFIDENTIAL = "confidential"
    RESTRICTED = "restricted"

@dataclass
class UserProfile:
    user_id: str = field(metadata={"classification": Classification.INTERNAL})
    name: str = field(metadata={"classification": Classification.CONFIDENTIAL, "pii": True})
    email: str = field(metadata={"classification": Classification.CONFIDENTIAL, "pii": True})
    ssn: str = field(metadata={"classification": Classification.RESTRICTED, "pii": True})

Cloud Resource Labeling#

Tag cloud resources with the classification of data they store:

# Terraform: tag resources with classification
resource "aws_s3_bucket" "customer_data" {
  bucket = "myapp-customer-data"
  tags = {
    DataClassification = "confidential"
    ContainsPII        = "true"
    DataOwner          = "customer-team"
    RetentionPolicy    = "7-years"
  }
}

resource "aws_rds_cluster" "payments" {
  cluster_identifier = "payments-db"
  tags = {
    DataClassification = "restricted"
    ContainsPCI        = "true"
    PCIScope           = "true"
    DataOwner          = "payments-team"
    RetentionPolicy    = "per-pci-dss"
  }
}

Enforce tagging with cloud policies:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireClassificationTag",
      "Effect": "Deny",
      "Action": ["s3:CreateBucket", "rds:CreateDBCluster"],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/DataClassification": "true"
        }
      }
    }
  ]
}

This prevents creating S3 buckets or RDS clusters without a classification tag.

Encryption Tiers#

Map classification to encryption requirements:

Tier At Rest In Transit Key Management Field-Level
Public Optional TLS N/A No
Internal SSE (provider-managed keys) TLS 1.2+ Provider-managed No
Confidential SSE-KMS (org-managed keys) TLS 1.2+ mTLS KMS with rotation Recommended for PII
Restricted SSE-KMS (customer-managed) TLS 1.3 + mTLS HSM-backed, BYOK Required

Field-Level Encryption#

For Restricted data, encrypt individual fields before storing them in the database. Even if the database is compromised, the fields remain encrypted:

from cryptography.fernet import Fernet
import base64
import os

class FieldEncryptor:
    def __init__(self, key_id: str):
        # Retrieve key from KMS/Vault, not hardcoded
        self.key = vault_client.get_key(key_id)
        self.fernet = Fernet(self.key)

    def encrypt(self, plaintext: str) -> str:
        return self.fernet.encrypt(plaintext.encode()).decode()

    def decrypt(self, ciphertext: str) -> str:
        return self.fernet.decrypt(ciphertext.encode()).decode()

# Usage
encryptor = FieldEncryptor("ssn-encryption-key")
encrypted_ssn = encryptor.encrypt("123-45-6789")
# Store encrypted_ssn in database

Envelope Encryption#

For large data, use envelope encryption: encrypt the data with a data encryption key (DEK), then encrypt the DEK with a key encryption key (KEK) managed by KMS:

Data → Encrypt with DEK → Encrypted Data
DEK  → Encrypt with KEK → Encrypted DEK

Store: Encrypted Data + Encrypted DEK
KMS holds: KEK (never leaves KMS)

AWS, GCP, and Azure KMS all support envelope encryption natively. This limits the amount of data sent to KMS (just the small DEK, not the entire dataset).

Retention and Disposal#

Retention Policy Matrix#

Data Type Classification Retention Period Basis
Customer PII Confidential Account lifetime + 30 days Business need + GDPR right to erasure
Payment records Restricted 7 years PCI-DSS, SOX
Health records (PHI) Restricted 6 years minimum HIPAA
Audit logs Confidential 7 years SOX, regulatory
Session logs Internal 90 days Operational
Marketing analytics Internal 2 years Business need
Deleted user data N/A 0 (immediate disposal) GDPR right to erasure

Automated Retention Enforcement#

S3 Lifecycle Rules:

{
  "Rules": [
    {
      "ID": "session-logs-90-days",
      "Filter": { "Prefix": "logs/sessions/" },
      "Status": "Enabled",
      "Expiration": { "Days": 90 }
    },
    {
      "ID": "audit-logs-7-years",
      "Filter": { "Prefix": "logs/audit/" },
      "Status": "Enabled",
      "Transition": [
        { "Days": 365, "StorageClass": "GLACIER" }
      ],
      "Expiration": { "Days": 2555 }
    }
  ]
}

Database Retention:

-- Partition by date for efficient retention
CREATE TABLE session_logs (
    id BIGSERIAL,
    user_id UUID,
    event TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE session_logs_2026_01 PARTITION OF session_logs
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

-- Drop old partitions (much faster than DELETE)
DROP TABLE session_logs_2025_11;

Cryptographic Erasure#

For restricted data, standard deletion is insufficient — data may be recoverable from disk, backups, or replicas. Cryptographic erasure destroys the encryption key, making the data permanently unreadable:

1. Data encrypted with DEK-12345
2. DEK-12345 encrypted with KEK in KMS
3. To dispose: schedule KMS key deletion (7-day waiting period on AWS)
4. After key deletion: encrypted data is permanently unrecoverable
5. Clean up: delete the encrypted data files (now meaningless bytes)

This is faster and more reliable than overwriting data across distributed storage systems.

Data Loss Prevention (DLP)#

DLP detects and prevents sensitive data from leaving authorized boundaries.

Network-Level DLP#

Inspect outbound traffic for sensitive data patterns:

# Example: AWS Macie for S3
# Scans S3 buckets for PII, credentials, and sensitive data
resource "aws_macie2_account" "main" {}

resource "aws_macie2_classification_job" "pii_scan" {
  job_type = "SCHEDULED"
  s3_job_definition {
    bucket_definitions {
      account_id = data.aws_caller_identity.current.account_id
      buckets    = ["myapp-customer-data"]
    }
  }
  schedule_frequency_details {
    weekly_schedule = "MONDAY"
  }
}

Application-Level DLP#

Scan data at API boundaries before it leaves the system:

import re

SENSITIVE_PATTERNS = {
    "ssn": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
    "credit_card": re.compile(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'),
    "email": re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
    "api_key": re.compile(r'\b(sk_live_|sk_test_|ghp_|AKIA)[A-Za-z0-9]{20,}\b'),
}

def scan_for_sensitive_data(text: str) -> list[dict]:
    findings = []
    for name, pattern in SENSITIVE_PATTERNS.items():
        matches = pattern.findall(text)
        if matches:
            findings.append({"type": name, "count": len(matches)})
    return findings

# Use in API middleware
def dlp_middleware(request, response):
    findings = scan_for_sensitive_data(response.body)
    if findings:
        log.warning(f"DLP alert: {findings} in response to {request.path}")
        if any(f["type"] in ["ssn", "credit_card"] for f in findings):
            return error_response(500, "Response blocked by DLP policy")
    return response

Log Sanitization#

Prevent sensitive data from entering logs:

import re

REDACT_PATTERNS = [
    (re.compile(r'(password["\s:=]+)[^\s,}"]+', re.I), r'\1[REDACTED]'),
    (re.compile(r'(authorization["\s:=]+bearer\s+)[^\s,}"]+', re.I), r'\1[REDACTED]'),
    (re.compile(r'\b\d{3}-\d{2}-\d{4}\b'), '[SSN-REDACTED]'),
    (re.compile(r'(sk_live_|sk_test_|ghp_|AKIA)[A-Za-z0-9]+'), '[KEY-REDACTED]'),
]

def sanitize_log_message(message: str) -> str:
    for pattern, replacement in REDACT_PATTERNS:
        message = pattern.sub(replacement, message)
    return message

Git Pre-Commit Hooks#

Prevent secrets and sensitive data from being committed to version control:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gitleaks/gitleaks
    rev: v8.18.0
    hooks:
      - id: gitleaks

  - repo: https://github.com/trufflesecurity/trufflehog
    rev: v3.63.0
    hooks:
      - id: trufflehog

Common Mistakes#

  1. Classifying data once and never updating. Data classification changes as regulations evolve and business needs shift. Review classifications annually and when introducing new data types.
  2. Over-classifying everything as restricted. If all data is restricted, the controls become so burdensome that people work around them. Accurate classification focuses protection where it matters.
  3. Not classifying derived data. A report aggregating confidential data is still confidential. An anonymized dataset derived from restricted data may be internal — but only if anonymization is verified.
  4. Implementing retention in application code instead of infrastructure. Application-level deletion is unreliable — it misses backups, replicas, caches, and log files. Use infrastructure-level retention (S3 lifecycle, partition dropping, KMS key deletion).
  5. Logging sensitive data while trying to prevent data leakage. The DLP system that logs the sensitive data it finds creates a new copy of that data. Scan and alert, but do not store the actual sensitive values in alerts.