Back to Blog
Agentic AIAgent MemoryLLMSoftware ArchitectureMem0ZepLettaLangMemKnowledge GraphRAG

Agent Memory in 2026: Mem0 vs Zep vs Letta vs LangMem

Architecture deep-dive of Mem0, Zep/Graphiti, Letta, and LangMem - what each optimises for, when to pick which, and which benchmark claims to ignore.

Architecture comparison diagram of the four agent memory frameworks in 2026 - Mem0 hybrid index, Zep/Graphiti temporal knowledge graph, Letta OS-style memory tiers, and LangMem KV plus vector store with hot path and background manager.

If you built an LLM agent in 2024, “memory” probably meant stuffing the last N turns into the prompt and praying the context window held. In 2026, that is no longer defensible. Agent memory is now a first-class architectural concern, with its own benchmark suite (LoCoMo, LongMemEval, BEAM) and a growing body of arxiv research. Four production frameworks now compete for the layer between your agent and its long-term state.

This post is for software architects evaluating those frameworks. I will walk through the four serious contenders - Mem0, Zep/Graphiti, Letta, and LangMem - and compare their architectures with minimal code. Then I will push back on the benchmark numbers everyone is quoting. By the end you should know which one fits which problem, and which vendor claims to ignore.


Table of Contents


Why memory is suddenly an architectural problem

LLMs have a fixed context window. Even at 1M+ tokens, three things break:

  1. Cost - every token in context is paid for on every call.
  2. Latency - attention is quadratic; long contexts mean slow responses.
  3. Recall - empirical “needle in a haystack” results show retrieval quality degrades well before the window fills.

For a chatbot answering one-shot questions, RAG over a vector store is enough. Agents are different. When the agent interacts with a user across days, weeks, and thousands of turns - and is expected to remember, update, and reason over what it learned - flat document chunks are the wrong shape of state. You need state that:

  • Persists across sessions
  • Updates when facts change (“Kendra used to work at Acme. Now she’s at Stripe.”)
  • Supports retrieval by recency, relevance, and relationship
  • Can be reasoned over, not just retrieved from

This is the gap the memory frameworks fill. Most of them adopt a taxonomy borrowed from cognitive science, popularised by LangChain’s LangMem launch:

  • Semantic memory - facts (“user prefers dark mode”)
  • Episodic memory - past interactions (“last Tuesday the user asked about refunds”)
  • Procedural memory - behavioural rules (“when asked about pricing, always link to the calculator”)

Each framework implements these three differently - and that is where the architectural arguments below actually start. If you have already read my GitNexus deep-dive on knowledge graphs for coding agents, several of the ideas below - particularly Zep’s temporal model - will feel familiar.


Mem0 - the hybrid extractor

Paper: Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (Chhikara et al, April 2025) Repo: mem0ai/mem0 - 57.2k stars, Apache-2.0 Backing: Y Combinator S24, $24M Series A

Mem0’s core idea is that memory should be extracted, not just stored. When a conversation happens, an LLM-driven pipeline pulls out salient facts (“user has a peanut allergy”), deduplicates them against existing memories, optionally updates the graph of relationships, and indexes the result across three stores: a vector index for semantic search, a graph for entity relationships, and a key-value store for fast structured lookups.

The paper introduces a graph variant (Mem0g) that uses a directed labelled graph to capture relational structure. The graph layer moved from experimental to production-ready in early 2026 with Neo4j, Kuzu, Cassandra, and Valkey backends.

Minimal usage:

from mem0 import Memory

memory = Memory()

# Extract and store memories from a conversation
memory.add(
    [
        {"role": "user", "content": "I'm allergic to peanuts and I prefer Italian food."},
        {"role": "assistant", "content": "Got it - I'll avoid peanut dishes."},
    ],
    user_id="alice",
)

# Retrieve relevant memories for a new query
results = memory.search(query="What food should I order?", user_id="alice", limit=3)
for m in results["results"]:
    print(m["memory"])
# → "Allergic to peanuts"
# → "Prefers Italian food"

When to choose Mem0

Mem0’s strengths are drop-in simplicity, multi-user/multi-tenant by default, a hosted platform option, and the largest community and integration footprint of the four (LangGraph, CrewAI, Vercel AI SDK, browser extension). If you are building a customer-facing assistant where each user has their own evolving profile, Mem0 will get you to a working prototype faster than the other three.

Trade-offs: The extraction pipeline runs an LLM call on every add. That is a real cost line. For high-volume ingestion you will want async batching, which Mem0 supports but does not make the default. If you are also routing inference through a semantic cache, the extractor calls are a separate cost line that the cache will not help with.


Zep / Graphiti - the temporal knowledge graph

Paper: Zep: A Temporal Knowledge Graph Architecture for Agent Memory (Rasmussen et al, January 2025) Repo: getzep/graphiti - 26.8k stars, Apache-2.0 Productisation: Zep (managed service) sits on Graphiti (open-source engine)

If Mem0 treats memory as a hybrid index, Zep treats it as a knowledge graph with time as a first-class dimension. Every fact stored in Graphiti has a bi-temporal validity window: when it was true in the world, and when it was true in the graph. When a contradicting fact arrives, the old fact is not deleted - it is invalidated, with the boundary recorded. You can query “what’s true now?” or “what was true on March 5th?”

The architecture has three layers:

  1. Episode subgraph - raw events/messages, timestamped, the source of truth
  2. Semantic entity subgraph - entities and relationships extracted from episodes
  3. Community subgraph - higher-level clusters and summaries

Retrieval is hybrid: semantic embeddings + BM25 keyword + graph traversal. Zep - the managed service built on Graphiti - advertises sub-200ms retrieval at scale. Graphiti itself leaves performance to your implementation, and published benchmarks (including Zep’s own corrected numbers) land in the 600–800ms p95 range for the OSS engine under realistic load.

Minimal usage with Graphiti:

from graphiti_core import Graphiti
from datetime import datetime

graphiti = Graphiti("bolt://localhost:7687", "neo4j", "password")
await graphiti.build_indices_and_constraints()

# Add an episode - Graphiti extracts entities, relationships, and timestamps
await graphiti.add_episode(
    name="User preference update",
    episode_body="Alice now works at Stripe; she left Acme last month.",
    source_description="chat log",
    reference_time=datetime.now(),
)

# Hybrid search - returns facts with their validity windows
results = await graphiti.search("Where does Alice work?")
for edge in results:
    print(edge.fact, edge.valid_at, edge.invalid_at)

When to choose Zep / Graphiti

Anywhere the truth changes over time. Customer support agents tracking case states, sales agents tracking deal stages, healthcare agents tracking patient timelines. The temporal model is the right abstraction when “yesterday’s fact” is materially different from “today’s fact.”

Trade-offs: A knowledge graph is more operational overhead than a vector store. You need Neo4j, FalkorDB, Kuzu, or Neptune running. Ingestion is LLM-heavy - every episode triggers entity extraction and relationship inference, and Graphiti defaults SEMAPHORE_LIMIT=10 to avoid provider rate-limit storms.


Letta (formerly MemGPT) - the OS-inspired memory hierarchy

Paper: MemGPT: Towards LLMs as Operating Systems (Packer et al, UC Berkeley, October 2023) Repo: letta-ai/letta - 23.1k stars, Apache-2.0

Heads up: several 2026 comparison blogs cite Letta at 13k stars. The current count is 23.1k - verify the GitHub badge yourself before quoting any star count from a write-up older than three months.

Letta’s mental model is OS-level: the context window is RAM, an archival vector store is disk, and an agent has tools to move data between them. The original MemGPT paper formalised this as a three-tier hierarchy:

  1. Core memory - small, always in-context, edited by the agent itself (memory_replace, memory_insert)
  2. Recall memory - searchable conversation history (recent chat log on disk)
  3. Archival memory - long-term vector store (archival_memory_search, archival_memory_insert)

The defining architectural choice is agent-as-memory-manager. The agent decides what to commit to core memory, when to page out, when to query archival. This makes Letta the most flexible of the four for self-improving behaviour - the agent can refine its own persona, learn user preferences, and rewrite its working memory based on what it sees.

Minimal usage:

import os
from letta_client import Letta

client = Letta(api_key=os.getenv("LETTA_API_KEY"))

agent = client.agents.create(
    model="anthropic/claude-sonnet-4-6",
    memory_blocks=[
        {"label": "human", "value": "Name: Alice. Role: software architect."},
        {"label": "persona", "value": "I am a helpful coding assistant."},
    ],
    tools=["web_search", "archival_memory_search"],
)

response = client.agents.messages.create(
    agent_id=agent.id,
    input="Remember that I prefer Go over Python for backend work.",
)
# The agent will call memory_insert or memory_replace internally.

When to choose Letta

Letta is the right pick for stateful agents that need to evolve their own behaviour over time - coding assistants, personalised tutors, long-running research agents. It is the closest thing to “give the agent a self-editable scratchpad and let it figure out what to save.”

Trade-offs: Letta is a platform, not a library. You run it as a service (Docker Compose, Postgres backend, OpenTelemetry traces) and your agents live behind a REST API. That is the right model for production stateful agents, but it is more infrastructure than a pip install library.


LangMem - the LangGraph-native SDK

Repo: langchain-ai/langmem - 1.5k stars, MIT Launch post: LangMem SDK for agent long-term memory

LangMem makes a different bet from the other three: skip the standalone service, ship primitives that plug into LangGraph’s existing storage layer (BaseStore, with backends for Postgres, MongoDB, Redis, in-memory). It covers the three-type taxonomy conceptually - semantic, episodic, procedural - with semantic and procedural well-supported by opinionated utilities; episodic memory is acknowledged as a future direction in the launch post. It exposes two operating modes:

  • Hot path - the agent calls manage_memory and search_memory tools mid-conversation
  • Background - a separate manager process extracts and consolidates memories after the fact

The unique architectural choice is procedural memory - the agent can update its own system prompt based on user feedback. That is a different concept from semantic or episodic memory and not well-supported by Mem0, Zep, or Letta out of the box.

Minimal usage:

from langgraph.prebuilt import create_react_agent
from langgraph.store.memory import InMemoryStore
from langmem import create_manage_memory_tool, create_search_memory_tool

store = InMemoryStore(index={"dims": 1536, "embed": "openai:text-embedding-3-small"})

agent = create_react_agent(
    "anthropic:claude-sonnet-4-6",
    tools=[
        create_manage_memory_tool(namespace=("memories",)),
        create_search_memory_tool(namespace=("memories",)),
    ],
    store=store,
)

agent.invoke({"messages": [{"role": "user", "content": "I prefer dark mode."}]})

When to choose LangMem

LangMem is the right pick if you are already on LangGraph, you want memory without a new dependency, and you care about procedural memory (agent self-editing its prompt). Treat memory as a feature of your agent runtime, not a separate service.

Trade-offs: LangMem is not a benchmark contender. It does not publish LOCOMO or LongMemEval scores. Its scope is deliberately narrower - primitives, not a memory service. The 1.5k stars reflect that it lives inside the larger LangChain ecosystem, not that it is struggling.


The benchmark wars - and why you should ignore most of them

Every framework cites benchmark numbers. Most of them are misleading. Here is what is actually going on.

Zep’s paper (2501.13956) claims 94.8% on DMR vs MemGPT’s 93.4%, and up to 18.5% accuracy improvement on LongMemEval. DMR was created by the MemGPT team, so beating their own benchmark carries weight - but DMR is also small (500 questions) and prone to ceiling effects.

Mem0’s paper (2504.19413) reports a 26% improvement over OpenAI memory on LOCOMO LLM-as-judge, 91% lower p95 latency, and 90% lower token cost. It reports Zep at 65.99% as the headline number, later corrected down to 58.44% after Mem0 excluded adversarial question categories the benchmark spec disallows.

Zep publicly rebutted with a corrected Zep score of 75.14%, alleging misconfiguration in Mem0’s evaluation.

Independent audits of LoCoMo itself have flagged methodology problems - adversarial category labelling, distribution issues across question types - that affect interpretation of any score on it.

The architect’s takeaway: self-reported memory benchmarks are not load-bearing. Use them to confirm a framework is not grossly broken, not to choose between them. The real selection criteria are:

CriterionMem0Zep/GraphitiLettaLangMem
Storage modelHybrid (vector + graph + KV)Temporal knowledge graphOS-style memory tiersKV + vector via LangGraph store
Self-editing agentNoNoYes (core memory tools)Partial (procedural memory)
Temporal queriesLimited (graph layer)First-class, bi-temporalNoNo
DeploymentLibrary or hostedLibrary or managed serviceService (Docker + Postgres)Library only
Best fit forPer-user profiles, multi-tenant chatbotsAnything where facts change over timeLong-running self-improving agentsAgents already on LangGraph
Star count (May 2026)57.2k26.8k (Graphiti)23.1k1.5k

A decision framework

Strip away the marketing and the question becomes: what shape does your state actually take?

  • If state is a flat profile per user that grows over time and you want it to just work - Mem0.
  • If state has time as an axis and facts get superseded - Zep / Graphiti. A customer support agent that needs to know “what is the current status of case X” is the prototypical fit.
  • If the agent itself should learn how to behave by editing its own working memory - Letta. This is the right model for coding agents, research agents, or anything you would describe as “an assistant that gets better at being yours.”
  • If you are on LangGraph and do not want a new service in your stack - LangMem. Especially if procedural memory (agent edits its own system prompt) is a feature you need.

A hybrid is also legitimate. The most ambitious 2026 deployments I have seen use Letta for the agent runtime and Graphiti as the long-term knowledge graph behind it - letting the agent self-manage core memory while pushing structured facts into a temporal graph for cross-session reasoning.


Closing

The right architecture question is not “which framework wins LOCOMO?” - it is “what does my agent need to remember, and how does that memory need to change?” Pick the framework whose abstractions match that shape, and treat the benchmark numbers as background noise.

If you want to go deeper, the four papers above are the canonical starting points. Mem0’s State of Agent Memory 2026 and Atlan’s framework comparison are both useful surveys, with the caveat that everyone in this space has an angle.


Sources

Papers:

Repositories:

Vendor architecture posts:

Independent surveys and benchmark audits:

Related reading on this blog: