Two-Pass Analysis: Summarize-Then-Correlate#

A 32B model with a 32K context window can process roughly 8-10 source files at once. A real codebase has hundreds. Concatenating everything into one prompt fails — the context overflows, quality degrades, and the model either truncates or hallucinates connections.

The two-pass pattern solves this by splitting analysis into two stages:

Pass 1 (Summarize): A fast 7B model reads each file independently and produces a focused summary.
Pass 2 (Correlate): A capable 32B model reads all summaries (which are much shorter than the original files) and answers the cross-cutting question.

This effectively multiplies your context window by the compression ratio of summarization — typically 10-20x. A 32K context that handles 10 files directly can handle 100-200 files through summaries.

Architecture#

Source Files (100+ files, 500K+ tokens total)
  │
  ├── file1.py ──→ 7B Model ──→ Summary (~200 tokens)
  ├── file2.py ──→ 7B Model ──→ Summary (~200 tokens)
  ├── file3.go ──→ 7B Model ──→ Summary (~200 tokens)
  │   ... (parallel, 3 workers)
  └── fileN.rs ──→ 7B Model ──→ Summary (~200 tokens)
  │
  │  Total summaries: ~20K tokens (fits in 32K context)
  │
  └──→ 32B Model + All Summaries + Question ──→ Analysis

Implementation#

Pass 1: Parallel Summarization#

import ollama
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path

SUMMARY_MODEL = "qwen2.5-coder:7b"
MAX_WORKERS = 3  # Ollama single-threads models; 3 workers avoids overwhelming it

PRESETS = {
    "architecture": {
        "focus": "dependencies, imports, data flow, coupling between components",
        "question": "How do the components of this codebase fit together?",
    },
    "security": {
        "focus": "input validation, authentication, secrets handling, error exposure",
        "question": "What security gaps exist in this codebase?",
    },
    "consistency": {
        "focus": "error handling patterns, naming conventions, code style",
        "question": "What inconsistencies exist across this codebase?",
    },
    "review": {
        "focus": "bugs, edge cases, unchecked assumptions, error handling",
        "question": "What bugs and issues exist in this codebase?",
    },
    "onboard": {
        "focus": "purpose, entry points, key abstractions, domain concepts",
        "question": "Explain this codebase to a new developer.",
    },
}

def summarize_file(filepath: str, preset: str) -> dict:
    """Summarize a single file using the 7B model."""
    content = Path(filepath).read_text()
    focus = PRESETS[preset]["focus"]

    prompt = f"""Summarize this source file with focus on: {focus}

Be specific. Reference function names, types, and concrete details.
Keep the summary under 300 words.

File: {filepath}

{content}


    response = ollama.chat(
        model=SUMMARY_MODEL,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.0, "num_predict": 512},
    )

    return {
        "file": filepath,
        "summary": response["message"]["content"],
        "tokens": response.get("eval_count", 0),
    }


def summarize_all(files: list[str], preset: str) -> list[dict]:
    """Summarize all files in parallel."""
    summaries = []

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = {executor.submit(summarize_file, f, preset): f for f in files}

        for future in as_completed(futures):
            filepath = futures[future]
            try:
                result = future.result()
                summaries.append(result)
                print(f"  Summarized: {filepath} ({result['tokens']} tokens)")
            except Exception as e:
                print(f"  Failed: {filepath}: {e}")

    return sorted(summaries, key=lambda s: s["file"])

Pass 2: Correlation#

CORRELATE_MODEL = "qwen2.5-coder:32b"

def correlate(summaries: list[dict], preset: str) -> str:
    """Correlate all summaries to answer the cross-cutting question."""
    question = PRESETS[preset]["question"]

    summary_text = "\n\n".join(
        f"### {s['file']}\n{s['summary']}" for s in summaries
    )

    prompt = f"""You are analyzing a codebase. Below are summaries of each file.

{summary_text}

Based on these summaries, answer this question:
{question}

Reference specific file names when making observations.
Organize your response by theme, not by file."""

    response = ollama.chat(
        model=CORRELATE_MODEL,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.1, "num_predict": 4096},
    )

    return response["message"]["content"]

Full Pipeline#

def analyze_codebase(directory: str, preset: str = "architecture"):
    """Run the full two-pass analysis."""
    # Discover source files
    extensions = {".py", ".go", ".rs", ".ts", ".js", ".java"}
    files = [
        str(p) for p in Path(directory).rglob("*")
        if p.suffix in extensions and "vendor" not in str(p) and "node_modules" not in str(p)
    ]

    print(f"Found {len(files)} files. Preset: {preset}")

    # Pass 1: Summarize
    print("\n--- Pass 1: Summarizing files ---")
    summaries = summarize_all(files, preset)

    # Pass 2: Correlate
    print("\n--- Pass 2: Correlating summaries ---")
    analysis = correlate(summaries, preset)

    return analysis

Caching Summaries#

Summarization is the expensive step (many API calls). Cache summaries and reuse them across different questions:

import hashlib

CACHE_DIR = Path.home() / ".cache" / "codebase-analysis"

def file_hash(filepath: str) -> str:
    """Hash based on path + mtime + size for change detection."""
    stat = Path(filepath).stat()
    key = f"{filepath}:{stat.st_mtime}:{stat.st_size}"
    return hashlib.sha256(key.encode()).hexdigest()[:16]

def load_cached_summaries(files: list[str], preset: str) -> tuple[list[dict], list[str]]:
    """Load cached summaries and return list of files needing summarization."""
    cache_file = CACHE_DIR / f"{preset}_summaries.json"
    cached = {}

    if cache_file.exists():
        cached = {s["file"]: s for s in json.loads(cache_file.read_text())}

    hit = []
    miss = []

    for f in files:
        fhash = file_hash(f)
        if f in cached and cached[f].get("hash") == fhash:
            hit.append(cached[f])
        else:
            miss.append(f)

    return hit, miss

def save_summaries(summaries: list[dict], preset: str):
    """Save summaries to cache."""
    CACHE_DIR.mkdir(parents=True, exist_ok=True)
    cache_file = CACHE_DIR / f"{preset}_summaries.json"

    # Add file hashes
    for s in summaries:
        s["hash"] = file_hash(s["file"])

    cache_file.write_text(json.dumps(summaries, indent=2))

With caching, the first analysis of a 100-file codebase takes 5-10 minutes. Subsequent analyses with different questions (but the same files) reuse the cached summaries and only run the correlation step — a single 32B call that takes 30-60 seconds.

Presets as Reusable Workflows#

Presets let you analyze the same codebase from different angles without rewriting prompts:

# Architecture overview
python analyze.py ~/projects/my-app --preset architecture

# Security review
python analyze.py ~/projects/my-app --preset security

# Onboarding guide
python analyze.py ~/projects/my-app --preset onboard

Each preset changes the summarization focus (what the 7B model looks for in each file) and the correlation question (what the 32B model synthesizes from the summaries).

Adding a new preset is a one-line change — define the focus and question. The two-pass infrastructure handles the rest.

When Two-Pass Breaks Down#

The pattern has limits:

Summarization is lossy. The 7B model may miss subtle details that matter for the correlation question. If you get suspicious results, spot-check a few summaries against the original files.
Cross-file dependencies at the token level. If two files share a specific variable name or magic constant that only matters in combination, the summarizer may not preserve that detail. Targeted extraction (asking for specific fields) helps.
Very large files. A single file that exceeds the 7B model’s context window needs to be chunked before summarization. Split at function or class boundaries.
Real-time analysis. The parallel summarization step takes minutes for large codebases. This is a batch pattern, not an interactive one.

For these cases, consider RAG (semantic search over the codebase) or targeted extraction (pulling specific structured data from each file instead of free-form summaries).

Common Mistakes#

Using too many parallel workers. Ollama runs one inference at a time per model. More than 3 workers creates a queue that does not improve throughput but increases memory pressure. Measure actual parallelism before increasing workers.
Not caching summaries. Re-summarizing 100 files every time you change the correlation question wastes 90% of the work. Cache summaries and invalidate only when files change.
Summarizing with the same model used for correlation. The point of two passes is using a fast, cheap model for the N-file summarization and a capable model for the single correlation. Using 32B for both is N times slower with no benefit.
Asking the summarizer to answer the question. The summarizer should capture relevant facts, not draw conclusions. Conclusions from a 7B model analyzing a single file are unreliable. Let the 32B model draw conclusions from the full picture.
Not validating summaries on a sample. Before trusting a 100-file analysis, read 3-5 summaries and compare them to the original files. If the summaries miss important details, adjust the preset focus or switch to a more capable summarization model.