RAG for Codebases Without Cloud APIs#
When a codebase has hundreds of files, neither direct concatenation nor summarize-then-correlate is ideal for targeted questions like “where is authentication handled?” or “what calls the payment API?” RAG (Retrieval-Augmented Generation) indexes the codebase into a vector database and retrieves only the relevant chunks for each query.
The key advantage: query time is constant regardless of codebase size. Whether the codebase has 50 files or 5,000, a query takes the same time because only the top-K relevant chunks are retrieved and sent to the model.
Architecture#
Indexing (one-time, incremental updates):
Source files → Chunk by function/class boundaries
→ Embed chunks with local embedding model (nomic-embed-text)
→ Store vectors + metadata in ChromaDB
Querying (per question):
User question → Embed question
→ Search ChromaDB for top-K similar chunks
→ Send chunks + question to 32B model
→ Return answer with file/line referencesSetting Up#
# Pull the embedding model
ollama pull nomic-embed-text
# Install Python dependencies
pip install chromadb ollamaChunking Source Files#
Chunking strategy matters more than embedding model choice. Bad chunks produce bad retrieval.
Why Character-Count Chunking Fails for Code#
Splitting at every 500 characters breaks functions in half, separates a function signature from its body, and destroys the context that makes code meaningful. A chunk containing the second half of a function and the first half of the next is useless for retrieval.
Language-Aware Chunking#
The ideal approach splits at function, class, and method boundaries. A simpler but effective approximation: split at blank lines or significant indentation changes, preserving at least complete blocks.
from dataclasses import dataclass
from pathlib import Path
@dataclass
class Chunk:
filepath: str
start_line: int
end_line: int
content: str
def chunk_file(filepath: str, max_lines: int = 60, overlap: int = 5) -> list[Chunk]:
"""Chunk a file by line groups with overlap."""
lines = Path(filepath).read_text().splitlines()
if len(lines) <= max_lines:
return [Chunk(
filepath=filepath,
start_line=1,
end_line=len(lines),
content="\n".join(lines),
)]
chunks = []
start = 0
while start < len(lines):
end = min(start + max_lines, len(lines))
# Try to break at a blank line (natural boundary)
if end < len(lines):
for i in range(end, max(start + max_lines // 2, start), -1):
if lines[i].strip() == "":
end = i
break
chunk_lines = lines[start:end]
chunks.append(Chunk(
filepath=filepath,
start_line=start + 1,
end_line=end,
content="\n".join(chunk_lines),
))
start = end - overlap # Overlap preserves context at boundaries
return chunksThe overlap ensures that if a function straddles a chunk boundary, the next chunk includes the end of the previous one. This prevents losing context at split points.
Embedding and Indexing#
import chromadb
import ollama
import hashlib
import json
from pathlib import Path
EMBED_MODEL = "nomic-embed-text"
COLLECTION_NAME = "codebase"
def get_collection(persist_dir: str) -> chromadb.Collection:
"""Get or create a ChromaDB collection."""
client = chromadb.PersistentClient(path=persist_dir)
return client.get_or_create_collection(
name=COLLECTION_NAME,
metadata={"hnsw:space": "cosine"},
)
def embed_text(text: str) -> list[float]:
"""Generate embedding using local model."""
response = ollama.embed(model=EMBED_MODEL, input=text)
return response["embeddings"][0]
def index_file(collection: chromadb.Collection, filepath: str):
"""Chunk and index a single file."""
chunks = chunk_file(filepath)
ids = []
documents = []
embeddings = []
metadatas = []
for chunk in chunks:
chunk_id = hashlib.sha256(
f"{chunk.filepath}:{chunk.start_line}:{chunk.end_line}".encode()
).hexdigest()[:16]
# Prepend filepath for better embedding context
doc = f"File: {chunk.filepath} (lines {chunk.start_line}-{chunk.end_line})\n\n{chunk.content}"
ids.append(chunk_id)
documents.append(doc)
embeddings.append(embed_text(doc))
metadatas.append({
"filepath": chunk.filepath,
"start_line": chunk.start_line,
"end_line": chunk.end_line,
})
collection.upsert(ids=ids, documents=documents, embeddings=embeddings, metadatas=metadatas)
return len(chunks)
def index_codebase(directory: str, persist_dir: str):
"""Index all source files in a directory."""
extensions = {".py", ".go", ".rs", ".ts", ".js", ".java", ".rb", ".sh"}
exclude_dirs = {"vendor", "node_modules", ".git", "__pycache__", "dist", "build"}
files = [
str(p) for p in Path(directory).rglob("*")
if p.suffix in extensions and not any(d in p.parts for d in exclude_dirs)
]
collection = get_collection(persist_dir)
total_chunks = 0
for filepath in files:
n = index_file(collection, filepath)
total_chunks += n
print(f" Indexed: {filepath} ({n} chunks)")
print(f"\nTotal: {len(files)} files, {total_chunks} chunks")Incremental Indexing#
Re-indexing the entire codebase on every change is wasteful. Track file modification times to only re-embed changed files:
METADATA_FILE = "index_metadata.json"
def load_metadata(persist_dir: str) -> dict:
meta_path = Path(persist_dir) / METADATA_FILE
if meta_path.exists():
return json.loads(meta_path.read_text())
return {}
def save_metadata(persist_dir: str, metadata: dict):
meta_path = Path(persist_dir) / METADATA_FILE
meta_path.write_text(json.dumps(metadata, indent=2))
def file_fingerprint(filepath: str) -> str:
stat = Path(filepath).stat()
return f"{stat.st_mtime}:{stat.st_size}"
def index_incremental(directory: str, persist_dir: str):
"""Only re-index files that changed since last indexing."""
metadata = load_metadata(persist_dir)
collection = get_collection(persist_dir)
extensions = {".py", ".go", ".rs", ".ts", ".js", ".java"}
exclude_dirs = {"vendor", "node_modules", ".git", "__pycache__"}
files = [
str(p) for p in Path(directory).rglob("*")
if p.suffix in extensions and not any(d in p.parts for d in exclude_dirs)
]
changed = 0
for filepath in files:
fp = file_fingerprint(filepath)
if metadata.get(filepath) != fp:
index_file(collection, filepath)
metadata[filepath] = fp
changed += 1
print(f" Re-indexed: {filepath}")
save_metadata(persist_dir, metadata)
print(f"\n{changed} files re-indexed out of {len(files)} total")On a 200-file codebase, initial indexing takes 2-5 minutes. Subsequent runs with 5 changed files take 10-15 seconds.
Querying#
QUERY_MODEL = "qwen2.5-coder:32b"
TOP_K = 10
def query_codebase(question: str, persist_dir: str) -> str:
"""Search the indexed codebase and answer a question."""
collection = get_collection(persist_dir)
# Embed the question
question_embedding = embed_text(question)
# Retrieve top-K relevant chunks
results = collection.query(
query_embeddings=[question_embedding],
n_results=TOP_K,
)
# Build context from retrieved chunks
context_parts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
context_parts.append(doc)
context = "\n\n---\n\n".join(context_parts)
# Generate answer with the 32B model
prompt = f"""Answer the following question about a codebase using ONLY the code snippets provided below.
Reference specific file names and line numbers in your answer.
If the snippets do not contain enough information, say so.
Code snippets:
{context}
Question: {question}"""
response = ollama.chat(
model=QUERY_MODEL,
messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.1, "num_predict": 2048},
)
return response["message"]["content"]Example Queries#
# Find where authentication is implemented
answer = query_codebase("How is user authentication implemented? What middleware is used?", persist_dir)
# Trace data flow
answer = query_codebase("What happens when a payment is processed? Trace the flow from API to database.", persist_dir)
# Find error handling patterns
answer = query_codebase("How are errors handled and propagated in this codebase?", persist_dir)Embedding Model Choice#
| Model | Dimensions | Context | Memory | Speed | Use Case |
|---|---|---|---|---|---|
| nomic-embed-text | 768 | 8192 tokens | 274 MB | Very fast | Best all-around for code and text |
| mxbai-embed-large | 1024 | 512 tokens | 670 MB | Fast | Higher quality but shorter context |
| all-minilm | 384 | 256 tokens | 46 MB | Fastest | Minimal memory, adequate for short snippets |
nomic-embed-text is the recommended default: large enough context window for code chunks (8192 tokens covers most functions), good semantic similarity for code, and tiny memory footprint that does not compete with your generation models.
RAG vs Two-Pass: When to Use Which#
| Factor | RAG | Two-Pass (Summarize-Correlate) |
|---|---|---|
| Best for | Targeted questions about specific code | Cross-cutting questions about architecture |
| Query time | Constant (seconds) | Linear with file count (minutes) |
| Setup time | Indexing required (one-time + incremental) | No setup (runs on demand) |
| Question specificity | High (finds the exact function) | Low (synthesizes across all files) |
| Context coverage | Partial (top-K chunks only) | Complete (all files summarized) |
| Storage | ChromaDB on disk | Summary cache files |
Use RAG when: “Where is X implemented?” “What calls function Y?” “Show me how errors are handled in the API layer.”
Use two-pass when: “What are the architectural patterns in this codebase?” “What inconsistencies exist across all services?” “Explain this codebase to a new developer.”
Common Mistakes#
- Chunking by character count instead of code boundaries. A chunk that splits a function in half retrieves poorly. Split at blank lines, function boundaries, or class boundaries.
- Not including the file path in the embedded text. The embedding model needs to know what file a chunk comes from. Prepending
File: path/to/file.py (lines 10-50)improves retrieval relevance. - Setting TOP_K too low. The answer might span multiple files. Start with TOP_K=10 and increase if answers are incomplete. The cost of sending extra chunks to the 32B model is low compared to missing relevant context.
- Re-indexing the entire codebase on every query. Use incremental indexing based on file modification times. Only re-embed files that changed.
- Using RAG for architectural questions. RAG retrieves fragments. Architectural understanding requires seeing the whole picture. Use two-pass summarize-then-correlate for big-picture questions.