Agent Memory and Retrieval#
An agent without memory repeats mistakes, forgets context, and relearns the same facts every session. An agent with too much memory wastes context window tokens on irrelevant history and retrieves noise instead of signal. Effective memory sits between these extremes – storing what matters, retrieving what is relevant, and forgetting what is stale.
This reference covers the concrete patterns for building agent memory systems, from simple file-based approaches to production-grade retrieval pipelines.
Memory Tiers#
Agent memory operates on three tiers, each with different latency, capacity, and persistence characteristics.
Tier 1: Working Memory (Context Window). The current conversation. Zero retrieval latency because it is already in the prompt. Capacity is limited by the model’s context window (8K to 200K tokens depending on model). Everything here is immediately accessible but expensive in tokens. This is where the current task description, recent tool results, and active reasoning live.
Tier 2: Session Memory (Short-Term Store). Facts from the current session that have been summarized or compressed to save context window space. Stored in memory or a temporary file. Survives context window truncation but not session termination. Retrieval requires a lookup but is fast (milliseconds).
Tier 3: Persistent Memory (Long-Term Store). Facts that survive across sessions. Stored in databases, files, or vector stores. Retrieval requires search and may take tens to hundreds of milliseconds. This is where project knowledge, learned preferences, and episodic records live.
Short-Term Context Management#
The context window is the most constrained resource. Every token of memory is a token not available for reasoning. Manage it aggressively.
Sliding window with summarization. When the conversation grows long, summarize older turns instead of dropping them entirely. The agent retains the gist without the full token cost.
class SlidingContextManager:
def __init__(self, max_tokens: int = 100000, summary_threshold: int = 80000):
self.messages = []
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
async def add_message(self, message: dict):
self.messages.append(message)
if self.estimate_tokens() > self.summary_threshold:
await self.compress_old_messages()
async def compress_old_messages(self):
# Keep system prompt and last N messages intact
keep_recent = 10
to_summarize = self.messages[1:-keep_recent] # Skip system prompt
if not to_summarize:
return
summary = await self.llm.summarize(
"Summarize the key facts, decisions, and findings from this conversation "
"segment. Preserve specific values, file paths, and error messages.",
to_summarize
)
# Replace old messages with a single summary message
self.messages = (
[self.messages[0]] # System prompt
+ [{"role": "system", "content": f"[Session summary]: {summary}"}]
+ self.messages[-keep_recent:]
)Priority-based inclusion. Not all context is equally important. Rank items by relevance to the current task and include the highest-priority items first.
Priority order (highest to lowest):
- System instructions and safety constraints
- The current user request
- Tool results from the current reasoning chain
- Retrieved long-term memories matching the current task
- Earlier conversation turns from this session
- General project context
Token budgeting. Allocate fixed token budgets per category. System instructions get 2K tokens. Retrieved memories get 4K. Current conversation gets 20K. This prevents any single category from consuming the entire window.
Long-Term Memory Storage#
Structured Key-Value Storage#
For facts with known keys, a key-value store is the simplest and fastest retrieval mechanism. The agent knows what it is looking for and queries by key.
import sqlite3
class StructuredMemory:
def __init__(self, db_path: str):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS memory (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
category TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_accessed TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
access_count INTEGER DEFAULT 0
)
""")
def store(self, key: str, value: str, category: str = "general"):
self.conn.execute(
"INSERT OR REPLACE INTO memory (key, value, category) VALUES (?, ?, ?)",
(key, value, category)
)
self.conn.commit()
def retrieve(self, key: str) -> str | None:
self.conn.execute(
"UPDATE memory SET last_accessed = CURRENT_TIMESTAMP, "
"access_count = access_count + 1 WHERE key = ?", (key,)
)
row = self.conn.execute("SELECT value FROM memory WHERE key = ?", (key,)).fetchone()
return row[0] if row else None
def retrieve_by_category(self, category: str) -> list[tuple[str, str]]:
rows = self.conn.execute(
"SELECT key, value FROM memory WHERE category = ? ORDER BY last_accessed DESC",
(category,)
).fetchall()
return rowsBest for: Project metadata (framework, database type, deploy target), user preferences, configuration values, tool locations. Anything where you know the key at query time.
Vector Database Storage#
When the agent does not know exactly what it needs – “find anything relevant to this database timeout error” – vector search retrieves memories by semantic similarity.
import chromadb
class VectorMemory:
def __init__(self, persist_dir: str):
self.client = chromadb.PersistentClient(path=persist_dir)
self.collection = self.client.get_or_create_collection(
name="agent_memory",
metadata={"hnsw:space": "cosine"}
)
def store(self, memory_id: str, text: str, metadata: dict | None = None):
self.collection.add(
ids=[memory_id],
documents=[text],
metadatas=[metadata or {}]
)
def retrieve(self, query: str, n_results: int = 5,
where: dict | None = None) -> list[dict]:
results = self.collection.query(
query_texts=[query],
n_results=n_results,
where=where # Filter by metadata: {"category": "debugging"}
)
return [
{"id": id, "text": doc, "metadata": meta, "distance": dist}
for id, doc, meta, dist in zip(
results["ids"][0], results["documents"][0],
results["metadatas"][0], results["distances"][0]
)
]Best for: Episodic memories (past debugging sessions, incident resolutions), large knowledge bases, situations where the query is a natural language description of the problem.
Hybrid Storage#
In practice, use both. Structured storage for known-key lookups. Vector storage for semantic search. A unified interface queries both and merges results.
class HybridMemory:
def __init__(self, structured: StructuredMemory, vector: VectorMemory):
self.structured = structured
self.vector = vector
def store(self, key: str, text: str, category: str, enable_search: bool = True):
self.structured.store(key, text, category)
if enable_search:
self.vector.store(key, text, {"category": category, "key": key})
def retrieve(self, query: str, category: str | None = None) -> list[dict]:
results = []
# Try exact key match first
exact = self.structured.retrieve(query)
if exact:
results.append({"source": "exact", "text": exact, "score": 1.0})
# Then semantic search
where = {"category": category} if category else None
similar = self.vector.retrieve(query, n_results=5, where=where)
for item in similar:
results.append({
"source": "vector", "text": item["text"],
"score": 1.0 - item["distance"] # Convert distance to similarity
})
# Deduplicate and sort by score
seen = set()
unique = []
for r in sorted(results, key=lambda x: x["score"], reverse=True):
if r["text"] not in seen:
seen.add(r["text"])
unique.append(r)
return uniqueRAG Patterns for Agent Memory#
Retrieval-Augmented Generation applied to agent memory follows a specific pipeline: index memories at storage time, retrieve relevant memories at query time, inject them into the prompt.
Indexing strategy matters. Do not embed raw conversation turns. Extract structured facts first, then embed those. “The user said they want to fix the timeout” is noise. “Production database connection pool exhausted at pool_size=5” is a retrievable fact.
async def extract_and_store(conversation_turn: str, llm_client):
"""Extract memorable facts from a conversation turn."""
facts = await llm_client.generate(
"Extract specific, factual statements from this conversation turn. "
"Return as a JSON array of strings. Include specific values, paths, "
"configurations, and decisions. Exclude chitchat and filler.",
conversation_turn
)
for fact in json.loads(facts):
memory_id = hashlib.sha256(fact.encode()).hexdigest()[:16]
memory.store(memory_id, fact, category="extracted_fact")Retrieval at task start. When a new task begins, retrieve relevant memories before the agent starts reasoning. This front-loads context that the agent would otherwise need to rediscover.
Re-ranking after retrieval. Vector similarity is a rough filter. After retrieving the top 20 candidates, re-rank them using the LLM itself or a cross-encoder model to push the most relevant results to the top.
Memory Eviction Policies#
Memory stores grow without bound unless you evict old or irrelevant entries. Four eviction strategies, each suited to different memory types.
TTL (Time-to-Live). Memories expire after a fixed period. Good for episodic memories where recency is a strong proxy for relevance. A debugging session from six months ago is rarely useful.
LRU (Least Recently Used). Evict memories that have not been retrieved recently. If a memory is never accessed, it is not useful. Track access timestamps and prune the bottom percentile periodically.
LFU (Least Frequently Used). Evict memories with the lowest access count. Similar to LRU but favors memories that were historically useful even if not accessed recently. Better for project facts that are accessed in bursts.
Relevance decay. Combine recency and access frequency into a single score. Memories decay over time but each access refreshes their score. This mirrors how human memory works – frequently accessed recent memories are strongest.
import math
import time
def relevance_score(created_at: float, last_accessed: float, access_count: int,
decay_rate: float = 0.01) -> float:
"""Score combining recency, frequency, and time decay."""
age_days = (time.time() - created_at) / 86400
freshness = (time.time() - last_accessed) / 86400
recency_score = math.exp(-decay_rate * freshness)
frequency_score = math.log1p(access_count)
age_penalty = math.exp(-decay_rate * age_days * 0.1)
return recency_score * frequency_score * age_penaltyContext Window Optimization#
When injecting retrieved memories into the prompt, minimize token waste.
Deduplicate. If two memories convey the same fact in different words, include only the more specific one. “The database is PostgreSQL” and “We use PostgreSQL 15 on RDS” are redundant – keep the second.
Summarize clusters. If 10 retrieved memories all relate to the same topic, summarize them into a single paragraph instead of including all 10 verbatim.
Format for the model. Present memories in a structured format the model can scan quickly. Labeled sections outperform raw text dumps.
[Retrieved context for current task]
- Project uses PostgreSQL 15 on RDS with connection pooling via PgBouncer
- Last timeout incident (Feb 15): pool_size was 5, increased to 20
- Connection string format: postgresql+asyncpg://user:pass@host:5432/dbname
- SQLAlchemy pool_pre_ping=True prevents stale connections after maintenanceMeasure and iterate. Track whether retrieved memories actually influence the agent’s output. If the agent never references a category of retrieved memory, stop retrieving it. Every token of irrelevant context is a token of reasoning capacity wasted.