RAG for Codebases Without Cloud APIs#

When a codebase has hundreds of files, neither direct concatenation nor summarize-then-correlate is ideal for targeted questions like “where is authentication handled?” or “what calls the payment API?” RAG (Retrieval-Augmented Generation) indexes the codebase into a vector database and retrieves only the relevant chunks for each query.

The key advantage: query time is constant regardless of codebase size. Whether the codebase has 50 files or 5,000, a query takes the same time because only the top-K relevant chunks are retrieved and sent to the model.

Architecture#

Indexing (one-time, incremental updates):
  Source files → Chunk by function/class boundaries
    → Embed chunks with local embedding model (nomic-embed-text)
    → Store vectors + metadata in ChromaDB

Querying (per question):
  User question → Embed question
    → Search ChromaDB for top-K similar chunks
    → Send chunks + question to 32B model
    → Return answer with file/line references

Setting Up#

# Pull the embedding model
ollama pull nomic-embed-text

# Install Python dependencies
pip install chromadb ollama

Chunking Source Files#

Chunking strategy matters more than embedding model choice. Bad chunks produce bad retrieval.

Why Character-Count Chunking Fails for Code#

Splitting at every 500 characters breaks functions in half, separates a function signature from its body, and destroys the context that makes code meaningful. A chunk containing the second half of a function and the first half of the next is useless for retrieval.

Language-Aware Chunking#

The ideal approach splits at function, class, and method boundaries. A simpler but effective approximation: split at blank lines or significant indentation changes, preserving at least complete blocks.

from dataclasses import dataclass
from pathlib import Path

@dataclass
class Chunk:
    filepath: str
    start_line: int
    end_line: int
    content: str

def chunk_file(filepath: str, max_lines: int = 60, overlap: int = 5) -> list[Chunk]:
    """Chunk a file by line groups with overlap."""
    lines = Path(filepath).read_text().splitlines()

    if len(lines) <= max_lines:
        return [Chunk(
            filepath=filepath,
            start_line=1,
            end_line=len(lines),
            content="\n".join(lines),
        )]

    chunks = []
    start = 0

    while start < len(lines):
        end = min(start + max_lines, len(lines))

        # Try to break at a blank line (natural boundary)
        if end < len(lines):
            for i in range(end, max(start + max_lines // 2, start), -1):
                if lines[i].strip() == "":
                    end = i
                    break

        chunk_lines = lines[start:end]
        chunks.append(Chunk(
            filepath=filepath,
            start_line=start + 1,
            end_line=end,
            content="\n".join(chunk_lines),
        ))

        start = end - overlap  # Overlap preserves context at boundaries

    return chunks

The overlap ensures that if a function straddles a chunk boundary, the next chunk includes the end of the previous one. This prevents losing context at split points.

Embedding and Indexing#

import chromadb
import ollama
import hashlib
import json
from pathlib import Path

EMBED_MODEL = "nomic-embed-text"
COLLECTION_NAME = "codebase"

def get_collection(persist_dir: str) -> chromadb.Collection:
    """Get or create a ChromaDB collection."""
    client = chromadb.PersistentClient(path=persist_dir)
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"},
    )

def embed_text(text: str) -> list[float]:
    """Generate embedding using local model."""
    response = ollama.embed(model=EMBED_MODEL, input=text)
    return response["embeddings"][0]

def index_file(collection: chromadb.Collection, filepath: str):
    """Chunk and index a single file."""
    chunks = chunk_file(filepath)

    ids = []
    documents = []
    embeddings = []
    metadatas = []

    for chunk in chunks:
        chunk_id = hashlib.sha256(
            f"{chunk.filepath}:{chunk.start_line}:{chunk.end_line}".encode()
        ).hexdigest()[:16]

        # Prepend filepath for better embedding context
        doc = f"File: {chunk.filepath} (lines {chunk.start_line}-{chunk.end_line})\n\n{chunk.content}"

        ids.append(chunk_id)
        documents.append(doc)
        embeddings.append(embed_text(doc))
        metadatas.append({
            "filepath": chunk.filepath,
            "start_line": chunk.start_line,
            "end_line": chunk.end_line,
        })

    collection.upsert(ids=ids, documents=documents, embeddings=embeddings, metadatas=metadatas)
    return len(chunks)

def index_codebase(directory: str, persist_dir: str):
    """Index all source files in a directory."""
    extensions = {".py", ".go", ".rs", ".ts", ".js", ".java", ".rb", ".sh"}
    exclude_dirs = {"vendor", "node_modules", ".git", "__pycache__", "dist", "build"}

    files = [
        str(p) for p in Path(directory).rglob("*")
        if p.suffix in extensions and not any(d in p.parts for d in exclude_dirs)
    ]

    collection = get_collection(persist_dir)
    total_chunks = 0

    for filepath in files:
        n = index_file(collection, filepath)
        total_chunks += n
        print(f"  Indexed: {filepath} ({n} chunks)")

    print(f"\nTotal: {len(files)} files, {total_chunks} chunks")

Incremental Indexing#

Re-indexing the entire codebase on every change is wasteful. Track file modification times to only re-embed changed files:

METADATA_FILE = "index_metadata.json"

def load_metadata(persist_dir: str) -> dict:
    meta_path = Path(persist_dir) / METADATA_FILE
    if meta_path.exists():
        return json.loads(meta_path.read_text())
    return {}

def save_metadata(persist_dir: str, metadata: dict):
    meta_path = Path(persist_dir) / METADATA_FILE
    meta_path.write_text(json.dumps(metadata, indent=2))

def file_fingerprint(filepath: str) -> str:
    stat = Path(filepath).stat()
    return f"{stat.st_mtime}:{stat.st_size}"

def index_incremental(directory: str, persist_dir: str):
    """Only re-index files that changed since last indexing."""
    metadata = load_metadata(persist_dir)
    collection = get_collection(persist_dir)

    extensions = {".py", ".go", ".rs", ".ts", ".js", ".java"}
    exclude_dirs = {"vendor", "node_modules", ".git", "__pycache__"}

    files = [
        str(p) for p in Path(directory).rglob("*")
        if p.suffix in extensions and not any(d in p.parts for d in exclude_dirs)
    ]

    changed = 0
    for filepath in files:
        fp = file_fingerprint(filepath)
        if metadata.get(filepath) != fp:
            index_file(collection, filepath)
            metadata[filepath] = fp
            changed += 1
            print(f"  Re-indexed: {filepath}")

    save_metadata(persist_dir, metadata)
    print(f"\n{changed} files re-indexed out of {len(files)} total")

On a 200-file codebase, initial indexing takes 2-5 minutes. Subsequent runs with 5 changed files take 10-15 seconds.

Querying#

QUERY_MODEL = "qwen2.5-coder:32b"
TOP_K = 10

def query_codebase(question: str, persist_dir: str) -> str:
    """Search the indexed codebase and answer a question."""
    collection = get_collection(persist_dir)

    # Embed the question
    question_embedding = embed_text(question)

    # Retrieve top-K relevant chunks
    results = collection.query(
        query_embeddings=[question_embedding],
        n_results=TOP_K,
    )

    # Build context from retrieved chunks
    context_parts = []
    for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
        context_parts.append(doc)

    context = "\n\n---\n\n".join(context_parts)

    # Generate answer with the 32B model
    prompt = f"""Answer the following question about a codebase using ONLY the code snippets provided below.
Reference specific file names and line numbers in your answer.
If the snippets do not contain enough information, say so.

Code snippets:
{context}

Question: {question}"""

    response = ollama.chat(
        model=QUERY_MODEL,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.1, "num_predict": 2048},
    )

    return response["message"]["content"]

Example Queries#

# Find where authentication is implemented
answer = query_codebase("How is user authentication implemented? What middleware is used?", persist_dir)

# Trace data flow
answer = query_codebase("What happens when a payment is processed? Trace the flow from API to database.", persist_dir)

# Find error handling patterns
answer = query_codebase("How are errors handled and propagated in this codebase?", persist_dir)

Embedding Model Choice#

Model	Dimensions	Context	Memory	Speed	Use Case
nomic-embed-text	768	8192 tokens	274 MB	Very fast	Best all-around for code and text
mxbai-embed-large	1024	512 tokens	670 MB	Fast	Higher quality but shorter context
all-minilm	384	256 tokens	46 MB	Fastest	Minimal memory, adequate for short snippets

nomic-embed-text is the recommended default: large enough context window for code chunks (8192 tokens covers most functions), good semantic similarity for code, and tiny memory footprint that does not compete with your generation models.

RAG vs Two-Pass: When to Use Which#

Factor	RAG	Two-Pass (Summarize-Correlate)
Best for	Targeted questions about specific code	Cross-cutting questions about architecture
Query time	Constant (seconds)	Linear with file count (minutes)
Setup time	Indexing required (one-time + incremental)	No setup (runs on demand)
Question specificity	High (finds the exact function)	Low (synthesizes across all files)
Context coverage	Partial (top-K chunks only)	Complete (all files summarized)
Storage	ChromaDB on disk	Summary cache files

Use RAG when: “Where is X implemented?” “What calls function Y?” “Show me how errors are handled in the API layer.”

Use two-pass when: “What are the architectural patterns in this codebase?” “What inconsistencies exist across all services?” “Explain this codebase to a new developer.”

Common Mistakes#

Chunking by character count instead of code boundaries. A chunk that splits a function in half retrieves poorly. Split at blank lines, function boundaries, or class boundaries.
Not including the file path in the embedded text. The embedding model needs to know what file a chunk comes from. Prepending File: path/to/file.py (lines 10-50) improves retrieval relevance.
Setting TOP_K too low. The answer might span multiple files. Start with TOP_K=10 and increase if answers are incomplete. The cost of sending extra chunks to the 32B model is low compared to missing relevant context.
Re-indexing the entire codebase on every query. Use incremental indexing based on file modification times. Only re-embed files that changed.
Using RAG for architectural questions. RAG retrieves fragments. Architectural understanding requires seeing the whole picture. Use two-pass summarize-then-correlate for big-picture questions.