Skip to main content

MemGPT: Virtual Context Management for LLM Agents

· 6 min read
Mike Thrift
Mike Thrift
Marketing Manager

The constraint that limits most LLM agents is not intelligence — it's memory. I've been thinking about this concretely in the context of Beancount ledgers that span years of transactions: no matter how capable the underlying model, once the ledger history exceeds the context window, the agent starts forgetting. MemGPT (Packer et al., UC Berkeley, 2023) attacks this problem directly by borrowing a solution that operating systems solved decades ago.

The paper

2026-05-02-memgpt-towards-llms-as-operating-systems

"MemGPT: Towards LLMs as Operating Systems" (Packer, Wooders, Lin, Fang, Patil, Stoica, Gonzalez; arXiv:2310.08560) proposes virtual context management — a deliberate analogy to how OSes create the illusion of large virtual memory by paging between fast RAM and slow disk. The LLM's context window plays the role of RAM: scarce, fast, directly accessible. Two external stores play the role of disk: a recall store (recent message history) and an archival store (a searchable long-term database for arbitrary text). The agent itself decides what to read in from external storage and what to evict from context, using explicit function calls — tools that move data between tiers. The system triggers an eviction warning at 70% context capacity and forces a flush at 100%, generating a recursive summary of evicted messages to avoid total information loss.

The paper evaluates MemGPT on two domains: multi-session conversational agents (the Multi-Session Chat dataset) and document analysis over large corpora that exceed the model's native context window.

Key ideas

  • Three memory tiers: in-context working memory (fast, limited), recall storage (recent messages, searchable), and archival storage (long-term, indexed). The agent writes to all three via tool calls.
  • Deep Memory Retrieval (DMR): the evaluation task that requires consistent recall across multiple past sessions. With GPT-4, the standard fixed-context baseline achieves 32.1% accuracy; MemGPT jumps it to 92.5%. GPT-4 Turbo baseline: 35.3% → 93.4%.
  • Nested key-value retrieval: the document-analysis stress test. Standard GPT-4 hits 0% accuracy at three levels of nesting; MemGPT with GPT-4 sustains performance by making iterative archival lookups.
  • Control flow via interrupts: the agent signals when it needs more time (to issue memory operations) before responding, analogous to an OS interrupt. This keeps the system responsive without forcing everything into a single inference pass.
  • The eviction problem: when context is full, content is summarized and flushed. Recursive summarization preserves the gist but inevitably loses detail — a tradeoff the paper acknowledges but does not fully quantify.

What holds up — and what doesn't

The DMR numbers are striking: a 60-point accuracy gap between MemGPT and a standard GPT-4 baseline on the Multi-Session Chat dataset is not noise. The nested KV result — baselines failing at 0% while MemGPT continues to work — demonstrates something real about the value of iterative, tool-mediated retrieval versus passive long-context exposure. This connects to Liu et al.'s "Lost in the Middle" finding (arXiv:2307.03172): even when information physically fits in the context window, models degrade for content buried in the middle. MemGPT sidesteps this by retrieving only what is immediately needed.

That said, the evaluation has real holes. The Multi-Session Chat dataset is narrow — human-generated persona chats with tightly controlled formats. How the approach scales to messier real-world conversations or domain-specific corpora (financial filings, regulatory correspondence) is untested. The archival storage in the experiments is a simple vector database; whether retrieval quality remains high as the archive grows into the millions of documents is left open. More fundamentally: the agent's retrieval strategy is only as good as its queries. If the agent doesn't know what it doesn't know — a common failure mode in long-horizon tasks — it will never issue the right archival lookup, and the whole architecture collapses gracefully into the same fixed-context failure mode.

There is also a latency cost that the paper treats lightly. Every archival lookup is an additional LLM inference call (to generate the query) plus a vector search. For a Beancount agent handling a routine reconciliation over years of data, this could multiply into many round-trips per response. The paper does not report wall-clock latency comparisons.

Subsequent work has sharpened these critiques. A-MEM (arXiv:2502.12110) claims at least 2× better performance than MemGPT on multi-hop tasks, arguing that MemGPT's rigid tier structure underperforms more dynamic memory curation. Mem0 benchmarks (2024-2025) show competing approaches outperforming MemGPT on accuracy and speed in some settings. The original authors have since evolved the project into Letta (September 2024), an open-source agent framework with asynchronous "sleep-time compute" for memory consolidation — an implicit acknowledgment that the synchronous, single-agent design has scaling limits.

Why this matters for finance AI

A Beancount ledger for a small business accumulates tens of thousands of transactions over a decade. An agent tasked with year-end reconciliation, anomaly investigation, or multi-year trend analysis cannot fit everything in context. MemGPT's three-tier design maps almost directly: working memory holds the current transaction batch under review; recall storage holds recent session context (what we were reconciling last time); archival storage holds the full ledger history, journal entries, and prior anomaly reports. The function-call interface for memory operations is essentially the same interface the agent already needs for write-back operations — this is not a new capability class, just a new application of the same tool-calling machinery.

The deeper relevance is the framing shift: instead of asking "can we fit more in context?", MemGPT asks "can the agent manage its own attention?" For finance, that is the right question. A tax audit may surface a question about a transaction from three years ago. A competent human accountant retrieves the original invoice, cross-checks it against the ledger, and recalls the policy context from that year. That retrieval-on-demand behavior is exactly what MemGPT trains us to design for.

The honest caveat: MemGPT was not evaluated on financial data, and financial documents are structurally different from persona chats. Retrieval quality over dense numerical data, multi-currency transactions, and double-entry accounting schemas will need its own benchmark.

  • Lost in the Middle: How Language Models Use Long Contexts (Liu et al., arXiv:2307.03172) — the empirical foundation for why longer context windows alone don't solve the problem; models fail to attend to middle-document content, which motivates retrieval-based approaches like MemGPT.
  • A-MEM: Agentic Memory for LLM Agents (arXiv:2502.12110) — a 2025 follow-up claiming superior multi-hop memory performance by replacing MemGPT's rigid tier structure with dynamic memory curation; a necessary comparison point.
  • Gorilla: Large Language Model Connected with Massive APIs (arXiv:2305.15334) — next on this reading list; the retrieval-augmented tool-calling design there complements MemGPT's memory management by addressing how agents select the right tool from a large API surface.