Two-Pass Analysis: Summarize-Then-Correlate#
A 32B model with a 32K context window can process roughly 8-10 source files at once. A real codebase has hundreds. Concatenating everything into one prompt fails — the context overflows, quality degrades, and the model either truncates or hallucinates connections.
The two-pass pattern solves this by splitting analysis into two stages:
- Pass 1 (Summarize): A fast 7B model reads each file independently and produces a focused summary.
- Pass 2 (Correlate): A capable 32B model reads all summaries (which are much shorter than the original files) and answers the cross-cutting question.
This effectively multiplies your context window by the compression ratio of summarization — typically 10-20x. A 32K context that handles 10 files directly can handle 100-200 files through summaries.
Architecture#
Source Files (100+ files, 500K+ tokens total)
│
├── file1.py ──→ 7B Model ──→ Summary (~200 tokens)
├── file2.py ──→ 7B Model ──→ Summary (~200 tokens)
├── file3.go ──→ 7B Model ──→ Summary (~200 tokens)
│ ... (parallel, 3 workers)
└── fileN.rs ──→ 7B Model ──→ Summary (~200 tokens)
│
│ Total summaries: ~20K tokens (fits in 32K context)
│
└──→ 32B Model + All Summaries + Question ──→ AnalysisImplementation#
Pass 1: Parallel Summarization#
import ollama
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
SUMMARY_MODEL = "qwen2.5-coder:7b"
MAX_WORKERS = 3 # Ollama single-threads models; 3 workers avoids overwhelming it
PRESETS = {
"architecture": {
"focus": "dependencies, imports, data flow, coupling between components",
"question": "How do the components of this codebase fit together?",
},
"security": {
"focus": "input validation, authentication, secrets handling, error exposure",
"question": "What security gaps exist in this codebase?",
},
"consistency": {
"focus": "error handling patterns, naming conventions, code style",
"question": "What inconsistencies exist across this codebase?",
},
"review": {
"focus": "bugs, edge cases, unchecked assumptions, error handling",
"question": "What bugs and issues exist in this codebase?",
},
"onboard": {
"focus": "purpose, entry points, key abstractions, domain concepts",
"question": "Explain this codebase to a new developer.",
},
}
def summarize_file(filepath: str, preset: str) -> dict:
"""Summarize a single file using the 7B model."""
content = Path(filepath).read_text()
focus = PRESETS[preset]["focus"]
prompt = f"""Summarize this source file with focus on: {focus}
Be specific. Reference function names, types, and concrete details.
Keep the summary under 300 words.
File: {filepath}{content}
response = ollama.chat(
model=SUMMARY_MODEL,
messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.0, "num_predict": 512},
)
return {
"file": filepath,
"summary": response["message"]["content"],
"tokens": response.get("eval_count", 0),
}
def summarize_all(files: list[str], preset: str) -> list[dict]:
"""Summarize all files in parallel."""
summaries = []
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = {executor.submit(summarize_file, f, preset): f for f in files}
for future in as_completed(futures):
filepath = futures[future]
try:
result = future.result()
summaries.append(result)
print(f" Summarized: {filepath} ({result['tokens']} tokens)")
except Exception as e:
print(f" Failed: {filepath}: {e}")
return sorted(summaries, key=lambda s: s["file"])Pass 2: Correlation#
CORRELATE_MODEL = "qwen2.5-coder:32b"
def correlate(summaries: list[dict], preset: str) -> str:
"""Correlate all summaries to answer the cross-cutting question."""
question = PRESETS[preset]["question"]
summary_text = "\n\n".join(
f"### {s['file']}\n{s['summary']}" for s in summaries
)
prompt = f"""You are analyzing a codebase. Below are summaries of each file.
{summary_text}
Based on these summaries, answer this question:
{question}
Reference specific file names when making observations.
Organize your response by theme, not by file."""
response = ollama.chat(
model=CORRELATE_MODEL,
messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.1, "num_predict": 4096},
)
return response["message"]["content"]Full Pipeline#
def analyze_codebase(directory: str, preset: str = "architecture"):
"""Run the full two-pass analysis."""
# Discover source files
extensions = {".py", ".go", ".rs", ".ts", ".js", ".java"}
files = [
str(p) for p in Path(directory).rglob("*")
if p.suffix in extensions and "vendor" not in str(p) and "node_modules" not in str(p)
]
print(f"Found {len(files)} files. Preset: {preset}")
# Pass 1: Summarize
print("\n--- Pass 1: Summarizing files ---")
summaries = summarize_all(files, preset)
# Pass 2: Correlate
print("\n--- Pass 2: Correlating summaries ---")
analysis = correlate(summaries, preset)
return analysisCaching Summaries#
Summarization is the expensive step (many API calls). Cache summaries and reuse them across different questions:
import hashlib
CACHE_DIR = Path.home() / ".cache" / "codebase-analysis"
def file_hash(filepath: str) -> str:
"""Hash based on path + mtime + size for change detection."""
stat = Path(filepath).stat()
key = f"{filepath}:{stat.st_mtime}:{stat.st_size}"
return hashlib.sha256(key.encode()).hexdigest()[:16]
def load_cached_summaries(files: list[str], preset: str) -> tuple[list[dict], list[str]]:
"""Load cached summaries and return list of files needing summarization."""
cache_file = CACHE_DIR / f"{preset}_summaries.json"
cached = {}
if cache_file.exists():
cached = {s["file"]: s for s in json.loads(cache_file.read_text())}
hit = []
miss = []
for f in files:
fhash = file_hash(f)
if f in cached and cached[f].get("hash") == fhash:
hit.append(cached[f])
else:
miss.append(f)
return hit, miss
def save_summaries(summaries: list[dict], preset: str):
"""Save summaries to cache."""
CACHE_DIR.mkdir(parents=True, exist_ok=True)
cache_file = CACHE_DIR / f"{preset}_summaries.json"
# Add file hashes
for s in summaries:
s["hash"] = file_hash(s["file"])
cache_file.write_text(json.dumps(summaries, indent=2))With caching, the first analysis of a 100-file codebase takes 5-10 minutes. Subsequent analyses with different questions (but the same files) reuse the cached summaries and only run the correlation step — a single 32B call that takes 30-60 seconds.
Presets as Reusable Workflows#
Presets let you analyze the same codebase from different angles without rewriting prompts:
# Architecture overview
python analyze.py ~/projects/my-app --preset architecture
# Security review
python analyze.py ~/projects/my-app --preset security
# Onboarding guide
python analyze.py ~/projects/my-app --preset onboardEach preset changes the summarization focus (what the 7B model looks for in each file) and the correlation question (what the 32B model synthesizes from the summaries).
Adding a new preset is a one-line change — define the focus and question. The two-pass infrastructure handles the rest.
When Two-Pass Breaks Down#
The pattern has limits:
- Summarization is lossy. The 7B model may miss subtle details that matter for the correlation question. If you get suspicious results, spot-check a few summaries against the original files.
- Cross-file dependencies at the token level. If two files share a specific variable name or magic constant that only matters in combination, the summarizer may not preserve that detail. Targeted extraction (asking for specific fields) helps.
- Very large files. A single file that exceeds the 7B model’s context window needs to be chunked before summarization. Split at function or class boundaries.
- Real-time analysis. The parallel summarization step takes minutes for large codebases. This is a batch pattern, not an interactive one.
For these cases, consider RAG (semantic search over the codebase) or targeted extraction (pulling specific structured data from each file instead of free-form summaries).
Common Mistakes#
- Using too many parallel workers. Ollama runs one inference at a time per model. More than 3 workers creates a queue that does not improve throughput but increases memory pressure. Measure actual parallelism before increasing workers.
- Not caching summaries. Re-summarizing 100 files every time you change the correlation question wastes 90% of the work. Cache summaries and invalidate only when files change.
- Summarizing with the same model used for correlation. The point of two passes is using a fast, cheap model for the N-file summarization and a capable model for the single correlation. Using 32B for both is N times slower with no benefit.
- Asking the summarizer to answer the question. The summarizer should capture relevant facts, not draw conclusions. Conclusions from a 7B model analyzing a single file are unreliable. Let the 32B model draw conclusions from the full picture.
- Not validating summaries on a sample. Before trusting a 100-file analysis, read 3-5 summaries and compare them to the original files. If the summaries miss important details, adjust the preset focus or switch to a more capable summarization model.