The Agent Memory Benchmark Wars: 5 Benchmarks That Exposed AI's Amnesia Problem in 12 Months
In May 2026, Microsoft released STATE-Bench, an open-source benchmark for AI agent memory. The headline finding: GPT-5.1 without memory completes fewer than half of enterprise tasks reliably. Only 30% of travel-domain tasks succeed on all five runs.
Four weeks earlier, another Microsoft team published GroupMemBench, showing that the best agent memory system reaches only 46.0% accuracy in multi-party conversations, and a simple BM25 baseline matches or beats most purpose-built memory systems.
These aren't isolated data points. Between February and May 2026, the research community published at least five distinct benchmarks for agent memory — all reaching the same uncomfortable conclusion: current memory systems don't work where it counts.
This article maps the landscape: who measures what, what they found, and why the results point toward graph-structured memory as the path forward.
The Five Benchmarks
| Benchmark | Date | Publisher | Venue | Focus | Task Count | Key Finding |
|---|---|---|---|---|---|---|
| AMA-Bench | Feb 2026 | UCSD / Meta | ICML 2026 | Long-horizon trajectory memory | 3,696 QA pairs | Best system (AMA-Agent) reaches 57% accuracy; causality graphs are critical |
| MemoryArena | Feb 2026 | Stanford / UCSD | ICML 2026 | Multi-session interdependent tasks | 766 tasks | Agents saturated on LoCoMo collapse on multi-session tasks |
| EvoMemBench | May 2026 | Multiple | arXiv | Self-evolving across episodes | 6 datasets, 15 methods | No single memory form works across all settings |
| GroupMemBench | May 2026 | Microsoft Research | arXiv | Multi-party conversation memory | 6 query categories | Best system: 46%; BM25 is competitive |
| STATE-Bench | May 2026 | Microsoft | Open source | Enterprise procedural task memory | 450 tasks | <50% pass@1 without memory; 30% pass^5 in travel |
The table tells the story: every benchmark measures a different slice of the memory problem, and every slice reveals systematic failure.
What Each Benchmark Actually Probes
AMA-Bench: Causality Is the Missing Ingredient
AMA-Bench (arXiv 2602.22769, accepted ICML 2026) evaluates long-horizon memory for agentic applications. It combines 2,496 real-world agent trajectories with expert-curated QA pairs and 1,200 synthetic trajectories at varying lengths (8K to 128K tokens). The evaluation dimensions span recall, causal inference, state updating, and state abstraction.
The critical contribution is the accompanying AMA-Agent system, which uses a Causality Graph to preserve causal dependencies within interaction histories. Instead of flat vector embeddings, AMA-Agent constructs directed causality edges between state nodes and augments similarity-based retrieval with graph node traversal and keyword search tools.
Results under Qwen3-32B:
| Method | Recall | Causal Inference | State Updating | State Abstraction | Average |
|---|---|---|---|---|---|
| AMA-Agent | 0.62 | 0.61 | 0.53 | 0.47 | 0.57 |
| w/o Causality Graph | 0.48 (−22.6%) | 0.48 (−21.3%) | 0.36 (−32.1%) | 0.35 (−25.5%) | 0.43 (−24.6%) |
| w/o Tool-Augmented Retrieval | 0.47 (−24.2%) | 0.51 (−16.4%) | 0.42 (−20.8%) | 0.31 (−34.0%) | 0.44 (−22.8%) |
| HippoRAG2 (best RAG baseline) | — | — | — | — | 0.45 |
| MemoRAG (best memory method) | — | — | — | — | 0.46 |
| HiMem (Jan 2026, state-of-the-art) | — | — | — | — | 0.30 |
Removing the causality graph drops accuracy by 24.6%. This is the first direct evidence that graph structure — not just similarity search — is the deciding factor for agent memory.
MemoryArena: Multi-Session Decay
MemoryArena (arXiv 2602.16313, accepted ICML 2026) takes a different approach. Instead of post-hoc QA on trajectories, it evaluates agents in a Memory-Agent-Environment loop across 766 tasks spanning web navigation, preference-constrained planning, progressive information search, and formal reasoning. Each task has interdependent subtasks where later actions depend on information from earlier sessions — an average of 57 action steps per task.
The key metric is SR@k (Success Rate at subtask depth k), measuring how well agents sustain execution as dependencies compound across sessions. Across all methods, performance decays with each additional subtask. The finding: RAG-based systems degrade slower than agents with external memory modules, and long-context windows alone are insufficient when traces exceed 122K tokens.
Most tellingly, agents that saturate existing benchmarks like LoCoMo (which tests conversational fact recall) perform poorly on MemoryArena, demonstrating that recall ≠ functional memory.
GroupMemBench: Multi-User Collapse
GroupMemBench (Microsoft Research, May 2026) exposes the sharpest failure mode yet. All existing memory systems assume a single user, but production agents serve teams. The benchmark tests six categories: multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention.
| Method | Multi-Hop | Update | Ambiguity | Implicit | Temporal | Abstention | Average |
|---|---|---|---|---|---|---|---|
| BM25 | 40.1 | 25.2 | 14.2 | 40.8 | 54.9 | 78.0 | 43.2 |
| HippoRAG | 39.6 | 27.1 | 30.2 | 42.9 | 29.6 | 75.2 | 39.7 |
| Hindsight (best memory system) | 42.3 | 17.8 | 37.7 | 40.8 | 54.9 | 77.1 | 46.0 |
| GraphRAG | 12.1 | 14.0 | 19.8 | 14.3 | 5.6 | 67.0 | 20.6 |
A bare keyword-matching baseline (BM25) beats four of five agent memory systems on average. GraphRAG, designed for document-level knowledge graphs, collapses to 20.6% because its community detection flattens the threaded conversation structure that multi-party memory depends on. Knowledge update — tracking who said what, and which information supersedes older claims — hits just 27.1%.
EvoMemBench: No One-Size-Fits-All
EvoMemBench (arXiv, May 2026) takes a taxonomic approach, organising memory evaluation along two axes: scope (in-episode vs. cross-episode) and content (knowledge-oriented vs. execution-oriented). It evaluates 15 memory methods across 6 datasets.
The systematic result: no single memory architecture dominates. Retrieval-augmented methods (BM25, embedding-based) win on knowledge-intensive tasks. Procedural long-term memory methods (ReasoningBank, AWM) win on execution-oriented tasks like tool use and embodied control. Short-term compression methods (MemAgent, MemoBrain) frequently underperform the no-memory baseline.
The GraphRAG entry in this benchmark is instructive: it achieves 74.2% on in-episode knowledge retrieval but collapses to 10.5% on cross-episode execution tasks. Graph-based community summaries help with fact recall but fail for action guidance.
STATE-Bench: Enterprise Procedural Memory
STATE-Bench (Microsoft, May 2026) returns to the practical question that motivated the whole space: does memory make agents better at their jobs? Three domains (travel, customer support, shopping), 450 tasks with deterministic state assertions, and four metrics — task completion (pass@1, pass^5), cost efficiency, UX score, and per-run consistency.
The baseline measurement with GPT-5.1 (no memory):
| Domain | pass@1 | pass^5 |
|---|---|---|
| Travel | ~45% | ~30% |
| Customer Support | ~50% | ~35% |
| Shopping | ~55% | ~40% |
The gap between pass@1 and pass^5 is the headline: agents are unreliable even on identical tasks. The memory-agnostic design lets researchers plug in any memory architecture and measure whether it closes the consistency gap.
STATE-Bench also introduces an LLM-based user simulator with personality variants — one user is impatient and provides incomplete details, another is cooperative — forcing agents to actively gather information rather than assume.
The Synthesis: Five Benchmarks, One Pattern
Stepping back, a clear pattern emerges:
-
Causality matters more than similarity. AMA-Bench's ablation study proves that removing the causality graph costs 24.6% accuracy. Vector similarity alone is insufficient for agent memory.
-
Structure beats compression. GroupMemBench shows that BM25 — which preserves lexical structure — matches or beats learned memory systems that compress and lose information. GraphRAG's community detection, designed for document summarisation, flattens the thread-level structure multi-party memory depends on.
-
Recall ≠ functional memory. MemoryArena exposes the gap between recalling a fact (which agents can do) and using it to guide action across interdependent sessions (which they cannot). The correlation between QA accuracy and end-to-end success is high (Pearson 0.96–0.98), but absolute performance on both axes remains low.
-
No universal architecture exists. EvoMemBench shows that knowledge tasks favour retrieval-based methods, while execution tasks favour procedural memory. Any production system needs hybrid architectures.
-
The bar is lower than you think. When BM25 is competitive with purpose-built memory systems (GroupMemBench), and the best system achieves 57% accuracy (AMA-Bench), the field is clearly in its early stages.
Where Graphs Fit
The graph connection is not incidental. Three of the five benchmarks produce evidence directly supporting graph-structured memory:
-
AMA-Agent's causality graph is the single best-performing architecture at 57.22% accuracy. Directed edges between state nodes preserve the causal chains that flat embeddings erase.
-
GroupMemBench's multi-party structure is inherently a graph: who said what to whom, in which order, across which threads. Current systems flatten this structure during ingestion. Memory systems that preserve conversational thread graphs would likely outperform current approaches.
-
STATE-Bench's enterprise procedures have natural graph representations: policy check → eligibility validation → fee calculation → confirmation. Each step depends on previous state. A graph memory that tracks procedural state transitions could improve the consistency (pass^5) that STATE-Bench measures.
The counterpoint is equally important: GraphRAG's community detection, as evaluated in GroupMemBench and EvoMemBench, is the wrong kind of graph structure for agent memory. Document-level knowledge graphs optimised for global summarisation collapse on task-specific procedural memory. The graph needs to be at the right granularity: action-level causality graphs, not corpus-level topic clusters.
Gaps and Future Directions
| Gap | Evidence | What's Missing |
|---|---|---|
| Cross-domain memory | All benchmarks test single-domain in-episode | No benchmark measures whether an agent that learns travel policies can transfer procedural skill to shopping |
| Memory compaction under pressure | EvoMemBench varies context budgets (16K–128K) | No benchmark measures graceful degradation under extreme token pressure or during multi-day operation |
| Deterministic correctness | All benchmarks use LLM judges or approximate metrics | Production systems need verifiable state assertions like STATE-Bench's, but applied to memory content itself |
| Real-world deployment duration | Max episode length: ~128K tokens (AMA-Bench synthetic) | Production agents run for weeks; no benchmark tests memory drift over hundreds of sessions |
| Hybrid graph + vector memory | AMA-Agent uses both, but no benchmark compares architectures systematically | The graph vs. vector vs. hybrid question remains unanswered at benchmark scale |
A Practical Decision Matrix
For teams building production agents today, the choice of memory architecture should depend on your failure mode:
| Your Primary Failure Mode | Look At | Benchmark Evidence |
|---|---|---|
| Agent forgets facts across long conversations | Retrieval-augmented memory (BM25, embedding) | GroupMemBench: BM25 matches learned systems |
| Agent can't learn from past task execution | Procedural memory (ReasoningBank, AWM) | EvoMemBench: procedural methods win on execution |
| Agent is inconsistent on identical tasks | Deterministic state memory + pass^5 evaluation | STATE-Bench: pass^5 gap is the metric |
| Agent serves multiple users simultaneously | This is still unsolved — watch this space | GroupMemBench: best system at 46% |
| Agent needs causal reasoning across sessions | Causality graph memory (AMA-Agent approach) | AMA-Bench: +24.6% over structure-agnostic memory |
The Takeaway
The 2026 agent memory benchmark wave has done the field a service: it has replaced intuition with measurement. The uncomfortable findings — that BM25 beats learned systems, that causality graphs outperform similarity search by 25 points, that multi-party memory is essentially unsolved — are more valuable than any single memory architecture.
For graph practitioners, the message is clear. Graph structures for agent memory need to be action-level causality graphs, not document-level knowledge graphs. The right granularity, edge semantics, and temporal ordering make the difference between a 57% system and a 20% system.
The field is now waiting for a benchmark that combines all these dimensions: procedural, multi-session, multi-user, cross-domain, and measured over weeks of simulated operation. Until then, the five benchmarks provide the best evidence we have — and the evidence says current memory systems are not good enough.