The Agent Memory Benchmark Wars: 5 Benchmarks That Exposed AI's Amnesia Problem in 12 Months

In May 2026, Microsoft released STATE-Bench, an open-source benchmark for AI agent memory. The headline finding: GPT-5.1 without memory completes fewer than half of enterprise tasks reliably. Only 30% of travel-domain tasks succeed on all five runs.

Four weeks earlier, another Microsoft team published GroupMemBench, showing that the best agent memory system reaches only 46.0% accuracy in multi-party conversations, and a simple BM25 baseline matches or beats most purpose-built memory systems.

These aren't isolated data points. Between February and May 2026, the research community published at least five distinct benchmarks for agent memory — all reaching the same uncomfortable conclusion: current memory systems don't work where it counts.

This article maps the landscape: who measures what, what they found, and why the results point toward graph-structured memory as the path forward.

The Five Benchmarks

Benchmark	Date	Publisher	Venue	Focus	Task Count	Key Finding
AMA-Bench	Feb 2026	UCSD / Meta	ICML 2026	Long-horizon trajectory memory	3,696 QA pairs	Best system (AMA-Agent) reaches 57% accuracy; causality graphs are critical
MemoryArena	Feb 2026	Stanford / UCSD	ICML 2026	Multi-session interdependent tasks	766 tasks	Agents saturated on LoCoMo collapse on multi-session tasks
EvoMemBench	May 2026	Multiple	arXiv	Self-evolving across episodes	6 datasets, 15 methods	No single memory form works across all settings
GroupMemBench	May 2026	Microsoft Research	arXiv	Multi-party conversation memory	6 query categories	Best system: 46%; BM25 is competitive
STATE-Bench	May 2026	Microsoft	Open source	Enterprise procedural task memory	450 tasks	<50% pass@1 without memory; 30% pass^5 in travel

The table tells the story: every benchmark measures a different slice of the memory problem, and every slice reveals systematic failure.

What Each Benchmark Actually Probes

AMA-Bench: Causality Is the Missing Ingredient

AMA-Bench (arXiv 2602.22769, accepted ICML 2026) evaluates long-horizon memory for agentic applications. It combines 2,496 real-world agent trajectories with expert-curated QA pairs and 1,200 synthetic trajectories at varying lengths (8K to 128K tokens). The evaluation dimensions span recall, causal inference, state updating, and state abstraction.

The critical contribution is the accompanying AMA-Agent system, which uses a Causality Graph to preserve causal dependencies within interaction histories. Instead of flat vector embeddings, AMA-Agent constructs directed causality edges between state nodes and augments similarity-based retrieval with graph node traversal and keyword search tools.

Results under Qwen3-32B:

Method	Recall	Causal Inference	State Updating	State Abstraction	Average
AMA-Agent	0.62	0.61	0.53	0.47	0.57
w/o Causality Graph	0.48 (−22.6%)	0.48 (−21.3%)	0.36 (−32.1%)	0.35 (−25.5%)	0.43 (−24.6%)
w/o Tool-Augmented Retrieval	0.47 (−24.2%)	0.51 (−16.4%)	0.42 (−20.8%)	0.31 (−34.0%)	0.44 (−22.8%)
HippoRAG2 (best RAG baseline)	—	—	—	—	0.45
MemoRAG (best memory method)	—	—	—	—	0.46
HiMem (Jan 2026, state-of-the-art)	—	—	—	—	0.30

Removing the causality graph drops accuracy by 24.6%. This is the first direct evidence that graph structure — not just similarity search — is the deciding factor for agent memory.

MemoryArena: Multi-Session Decay

MemoryArena (arXiv 2602.16313, accepted ICML 2026) takes a different approach. Instead of post-hoc QA on trajectories, it evaluates agents in a Memory-Agent-Environment loop across 766 tasks spanning web navigation, preference-constrained planning, progressive information search, and formal reasoning. Each task has interdependent subtasks where later actions depend on information from earlier sessions — an average of 57 action steps per task.

The key metric is SR@k (Success Rate at subtask depth k), measuring how well agents sustain execution as dependencies compound across sessions. Across all methods, performance decays with each additional subtask. The finding: RAG-based systems degrade slower than agents with external memory modules, and long-context windows alone are insufficient when traces exceed 122K tokens.

Most tellingly, agents that saturate existing benchmarks like LoCoMo (which tests conversational fact recall) perform poorly on MemoryArena, demonstrating that recall ≠ functional memory.

GroupMemBench: Multi-User Collapse

GroupMemBench (Microsoft Research, May 2026) exposes the sharpest failure mode yet. All existing memory systems assume a single user, but production agents serve teams. The benchmark tests six categories: multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention.

Method	Multi-Hop	Update	Ambiguity	Implicit	Temporal	Abstention	Average
BM25	40.1	25.2	14.2	40.8	54.9	78.0	43.2
HippoRAG	39.6	27.1	30.2	42.9	29.6	75.2	39.7
Hindsight (best memory system)	42.3	17.8	37.7	40.8	54.9	77.1	46.0
GraphRAG	12.1	14.0	19.8	14.3	5.6	67.0	20.6

A bare keyword-matching baseline (BM25) beats four of five agent memory systems on average. GraphRAG, designed for document-level knowledge graphs, collapses to 20.6% because its community detection flattens the threaded conversation structure that multi-party memory depends on. Knowledge update — tracking who said what, and which information supersedes older claims — hits just 27.1%.

EvoMemBench: No One-Size-Fits-All

EvoMemBench (arXiv, May 2026) takes a taxonomic approach, organising memory evaluation along two axes: scope (in-episode vs. cross-episode) and content (knowledge-oriented vs. execution-oriented). It evaluates 15 memory methods across 6 datasets.

The systematic result: no single memory architecture dominates. Retrieval-augmented methods (BM25, embedding-based) win on knowledge-intensive tasks. Procedural long-term memory methods (ReasoningBank, AWM) win on execution-oriented tasks like tool use and embodied control. Short-term compression methods (MemAgent, MemoBrain) frequently underperform the no-memory baseline.

The GraphRAG entry in this benchmark is instructive: it achieves 74.2% on in-episode knowledge retrieval but collapses to 10.5% on cross-episode execution tasks. Graph-based community summaries help with fact recall but fail for action guidance.

STATE-Bench: Enterprise Procedural Memory

STATE-Bench (Microsoft, May 2026) returns to the practical question that motivated the whole space: does memory make agents better at their jobs? Three domains (travel, customer support, shopping), 450 tasks with deterministic state assertions, and four metrics — task completion (pass@1, pass^5), cost efficiency, UX score, and per-run consistency.

The baseline measurement with GPT-5.1 (no memory):

Domain	pass@1	pass^5
Travel	~45%	~30%
Customer Support	~50%	~35%
Shopping	~55%	~40%

The gap between pass@1 and pass^5 is the headline: agents are unreliable even on identical tasks. The memory-agnostic design lets researchers plug in any memory architecture and measure whether it closes the consistency gap.

STATE-Bench also introduces an LLM-based user simulator with personality variants — one user is impatient and provides incomplete details, another is cooperative — forcing agents to actively gather information rather than assume.

The Synthesis: Five Benchmarks, One Pattern

Stepping back, a clear pattern emerges:

Causality matters more than similarity. AMA-Bench's ablation study proves that removing the causality graph costs 24.6% accuracy. Vector similarity alone is insufficient for agent memory.
Structure beats compression. GroupMemBench shows that BM25 — which preserves lexical structure — matches or beats learned memory systems that compress and lose information. GraphRAG's community detection, designed for document summarisation, flattens the thread-level structure multi-party memory depends on.
Recall ≠ functional memory. MemoryArena exposes the gap between recalling a fact (which agents can do) and using it to guide action across interdependent sessions (which they cannot). The correlation between QA accuracy and end-to-end success is high (Pearson 0.96–0.98), but absolute performance on both axes remains low.
No universal architecture exists. EvoMemBench shows that knowledge tasks favour retrieval-based methods, while execution tasks favour procedural memory. Any production system needs hybrid architectures.
The bar is lower than you think. When BM25 is competitive with purpose-built memory systems (GroupMemBench), and the best system achieves 57% accuracy (AMA-Bench), the field is clearly in its early stages.

Where Graphs Fit

The graph connection is not incidental. Three of the five benchmarks produce evidence directly supporting graph-structured memory:

AMA-Agent's causality graph is the single best-performing architecture at 57.22% accuracy. Directed edges between state nodes preserve the causal chains that flat embeddings erase.
GroupMemBench's multi-party structure is inherently a graph: who said what to whom, in which order, across which threads. Current systems flatten this structure during ingestion. Memory systems that preserve conversational thread graphs would likely outperform current approaches.
STATE-Bench's enterprise procedures have natural graph representations: policy check → eligibility validation → fee calculation → confirmation. Each step depends on previous state. A graph memory that tracks procedural state transitions could improve the consistency (pass^5) that STATE-Bench measures.

The counterpoint is equally important: GraphRAG's community detection, as evaluated in GroupMemBench and EvoMemBench, is the wrong kind of graph structure for agent memory. Document-level knowledge graphs optimised for global summarisation collapse on task-specific procedural memory. The graph needs to be at the right granularity: action-level causality graphs, not corpus-level topic clusters.

Gaps and Future Directions

Gap	Evidence	What's Missing
Cross-domain memory	All benchmarks test single-domain in-episode	No benchmark measures whether an agent that learns travel policies can transfer procedural skill to shopping
Memory compaction under pressure	EvoMemBench varies context budgets (16K–128K)	No benchmark measures graceful degradation under extreme token pressure or during multi-day operation
Deterministic correctness	All benchmarks use LLM judges or approximate metrics	Production systems need verifiable state assertions like STATE-Bench's, but applied to memory content itself
Real-world deployment duration	Max episode length: ~128K tokens (AMA-Bench synthetic)	Production agents run for weeks; no benchmark tests memory drift over hundreds of sessions
Hybrid graph + vector memory	AMA-Agent uses both, but no benchmark compares architectures systematically	The graph vs. vector vs. hybrid question remains unanswered at benchmark scale

A Practical Decision Matrix

For teams building production agents today, the choice of memory architecture should depend on your failure mode:

Your Primary Failure Mode	Look At	Benchmark Evidence
Agent forgets facts across long conversations	Retrieval-augmented memory (BM25, embedding)	GroupMemBench: BM25 matches learned systems
Agent can't learn from past task execution	Procedural memory (ReasoningBank, AWM)	EvoMemBench: procedural methods win on execution
Agent is inconsistent on identical tasks	Deterministic state memory + pass^5 evaluation	STATE-Bench: pass^5 gap is the metric
Agent serves multiple users simultaneously	This is still unsolved — watch this space	GroupMemBench: best system at 46%
Agent needs causal reasoning across sessions	Causality graph memory (AMA-Agent approach)	AMA-Bench: +24.6% over structure-agnostic memory

The Takeaway

The 2026 agent memory benchmark wave has done the field a service: it has replaced intuition with measurement. The uncomfortable findings — that BM25 beats learned systems, that causality graphs outperform similarity search by 25 points, that multi-party memory is essentially unsolved — are more valuable than any single memory architecture.

For graph practitioners, the message is clear. Graph structures for agent memory need to be action-level causality graphs, not document-level knowledge graphs. The right granularity, edge semantics, and temporal ordering make the difference between a 57% system and a 20% system.

The field is now waiting for a benchmark that combines all these dimensions: procedural, multi-session, multi-user, cross-domain, and measured over weeks of simulated operation. Until then, the five benchmarks provide the best evidence we have — and the evidence says current memory systems are not good enough.