Back to Blog
Semantic CachepgvectorVector SearchSystem DesignBackend EngineeringLLMPerformance

Semantic Caching in Production: Beyond the Hype

How I built a 4-layer semantic cache for e-commerce search - two-zone thresholds, adaptive TTL, Kafka invalidation, and six production pitfalls nobody documents.

In my 13 years of experience working with and building enterprise applications, caching has always been a controversial figure. While many times it donned the role of a superhero by improving performance, the other times it has been truly a pain in the backside with its stale content delivery due to poor design. Semantic caching is also no different, except the problem it solves is subtler and the traps it sets are less obvious.

This post is not a tutorial on how to plug in a library and call it done. It is a detailed account of how I built a production-grade semantic cache for an e-commerce search system, the architectural decisions behind it, and the pitfalls I encountered that most write-ups conveniently skip. The lessons here apply whether you are caching search results, LLM API responses, or any embedding-driven computation that is expensive to repeat.


Table of Contents


The Problem with Exact-Match Caching

Traditional caches are the go-to for deterministic systems. You have a string or set of strings that produces the same results when it goes through your deterministic backend logic. Instead of overloading your backend server, you cache the response and serve it to the users. You are happy, your clients are happier.

An e-commerce search query or a FAQ chatbot conversation is different.

A user searching for “wireless headphones under 100”, another typing “budget bluetooth headphones”, and a third asking for “affordable over-ear wireless” are expressing the same intent. An exact-match cache treats them as three completely separate requests. Three trips to the database. Three embedding computations. Three result sets assembled from scratch, likely returning near-identical results.

At low traffic, this is wasteful. At scale, it is expensive. In LLM-backed systems where each query triggers an API call to a model like GPT-5.4 or Claude Opus 4.7, the cost compounds fast and your finance team may ask for your head when they see the bill. Research and production data suggest that 20–40% of queries in typical applications are semantically equivalent to something already answered, even when the surface text differs.

What your answer to this? - Clever Engineering! Why not try to capture the intent of the user instead of the literal text. Semantic caching addresses this by operating on meaning rather than identity. Instead of asking “is this string identical to something I’ve seen?”, it asks “is this semantically similar enough to a cached result that the answer would be the same?”

The shift sounds simple. The implementation is not.


What Semantic Cache Actually Is

At its core, a semantic cache is a key-value store where keys are embedding vectors and lookups are performed via approximate nearest-neighbour (ANN) search rather than hash equality.

The workflow is:

  1. Incoming query is converted to an embedding vector
  2. The cache is searched for stored vectors within a cosine similarity threshold
  3. If a sufficiently similar vector is found, the cached result is returned
  4. If not, the full computation runs, its result is stored with the query’s embedding, and future similar queries hit the cache instead

In an LLM context, step 4 is saving an API call - often measured in latency, cost, and token consumption. In a search context like mine, step 4 is saving an embedding computation plus a full vector similarity scan over a product catalogue - still significant when you are processing thousands of queries per minute.

One rule worth burning into memory for LLM-backed systems: if your cache lookup (embedding generation + vector scan) cannot complete in under ~50ms, you are eroding the latency benefit you were trying to gain. For a search pipeline like mine, the target is simpler - the cache hit must be cheaper than your full search execution. Measure your baseline first, then set your budget.

The next few sections cover how I implemented this on my e-commerce search service. If you are building an agent and want to cut LLM API costs, the same principles apply - the only difference is what sits behind the cache miss.


The Architecture: A 4-Layer Cache Pipeline

The design I landed on for my e-commerce search service treats caching as a tiered pipeline, not a single lookup. Each layer catches a different category of request, ordered from cheapest to most expensive to evaluate.

Incoming Search Query


┌───────────────────┐
│   L1: In-Process  │  ← Hot queries, monotonic TTL (120s hard wall)
│   (Python dict)   │    Sub-millisecond lookup
└────────┬──────────┘
         │ miss

┌───────────────────┐
│ L2: PostgreSQL    │  ← Exact query_text match, adaptive TTL
│  (query_cache)    │    Single UPDATE…RETURNING, hit_count tracked
└────────┬──────────┘
         │ miss

┌───────────────────┐
│  Phase 2: Lexical │  ← pg_trgm similarity ≥ 0.76
│  Near-Hit         │    Catches typos and minor variations
└────────┬──────────┘
         │ miss

┌───────────────────┐
│ Phase 3: Semantic │  ← pgvector HNSW cosine similarity
│  Scoring          │    Two-zone threshold decision
└────────┬──────────┘
         │ miss

┌───────────────────┐
│  Full Search      │  ← Embedding + ANN scan + result assembly
│  Execution        │    Result stored back into L2
└───────────────────┘

Each layer has a specific role:

L1 (In-Process) is a Python dictionary held in application memory. It handles the absolute hottest queries - the ones that repeat within the same process within a two-minute window. Lookup is sub-millisecond. It has both a monotonic TTL (120 seconds) and a hard wall-clock expiry to prevent stale entries surviving process longevity. On a PostgreSQL NOTIFY event triggered by a product update, affected entries are evicted immediately via an O(n) scan.

L2 (PostgreSQL exact match) uses a query_cache table in PostgreSQL. When a query arrives that exactly matches a previously cached query text, a single UPDATE…RETURNING increments the hit count and returns the result. Crucially, the TTL for this entry adapts based on hit frequency - more on that shortly.

Phase 2 (Lexical near-hit) uses PostgreSQL’s pg_trgm extension for trigram similarity. This catches surface-level variations: typos, spacing differences, minor word reorderings. A threshold of 0.76 trigram similarity is conservative enough to avoid false positives while still catching the obvious near-misses.

Phase 3 (Semantic scoring) is where pgvector enters. This is the full semantic cache lookup - embedding the query, searching the HNSW index with cosine distance, and making a confidence-zoned decision about whether to serve a cached result.

Why the phased approach you ask? You don’t need to build everything at one go. Start with a simple setup of L1 and L2 cache. Analyze the results and understand the patterns. Once you feel there’s a need to move to the next step, then only move to the next steps. While implementing the Semantic caching, you can do it via the “Shadow Mode”, you can capture the metrics of your semantic caching, but don’t serve the cached data to your users.


The Threshold Dilemma - and Why a Single Number Is Never Enough

Cosine similarity thresholds are where most semantic cache implementations either over-engineer or under-think. The naive approach is to pick a single number - say, 0.90 - and apply it everywhere. This fails in practice for a simple reason: different query types have fundamentally different tolerances for “close enough.”

Consider the difference between:

  • A user searching “red running shoes” vs “red trail shoes” - these look semantically close (cosine similarity might be ~0.87), but the intent and result set diverge meaningfully
  • A user searching “Sony WH-1000XM5” vs “Sony WH1000XM5” (missing hyphen) - trivially different, yet identical intent

A single global threshold cannot satisfy both cases simultaneously. Set it tight and the typo case misses; set it loose and the shoe case returns the wrong results.

My implementation uses a two-zone acceptance model:

SEMANTIC_STRONG_ACCEPT = 0.88   # High confidence - serve cache directly
SEMANTIC_BORDERLINE_LOW = 0.84  # Borderline - apply additional intent checks
# Below 0.84: hard reject, full search executes

In the strong accept zone (≥ 0.88), similarity is high enough that the result is served without further validation. Independent research on semantic caching for product search workloads corroborates this range - a threshold of 0.88 optimises for recall while maintaining acceptable precision for e-commerce queries.

In the borderline zone (0.84–0.88), similarity is meaningful but not conclusive. Before serving the cache, three additional intent conflict checks run:

  • _has_price_intent_conflict() - rejects if one query signals budget intent and the cached query signals premium (e.g., “cheap” vs “luxury”)
  • _has_brand_conflict() - rejects if the queries reference mutually exclusive brands
  • _has_category_conflict() - rejects if the queries belong to different product categories

Only if all three checks pass is the cached result served. If any conflict is detected, the full search pipeline executes. This adds latency but it will ensure you don’t give your users wrong results.

Below 0.84 is a hard reject. The semantic distance is wide enough that serving cached results is not worth the risk of returning irrelevant or misleading output.

The practical implication for anyone building this: do not pick a threshold from a blog post (including this one) and deploy it directly. The right approach is to run in shadow mode ( So good an approach that I had to mention it twice) first - compute similarity scores and log would-be cache hits without serving them. After roughly two weeks of real traffic, sample the logged pairs and manually evaluate whether the hits are genuinely equivalent. Let your workload calibrate the threshold, not a default.

A useful production signal once you go live: track your corrected cache rate - the fraction of cache hits that were subsequently identified as wrong (through user behaviour signals, explicit corrections, or downstream error rates). If this exceeds 1%, your threshold is too aggressive.


The Six Pitfalls Nobody Puts in the README

1. The Hit Rate Illusion

Vendor benchmarks frequently cite semantic cache hit rates of 80–95%. These numbers are almost always measuring match accuracy - how often the cache correctly identifies a similar query when one exists - not frequency - how often real production traffic actually finds a cache hit.

In practice, production hit rates vary dramatically by workload. FAQ-style queries can see 40–60%. Complex or long-tail product searches drop to 5–25%. Agentic or multi-step LLM workflows are lower still. Do not plan infrastructure or cost models around vendor benchmarks. Run shadow mode, measure your own workload, and set realistic expectations.

2. Cache Poisoning

Cache poisoning is the failure mode that bit me hardest in testing - and it is the one I see teams dismiss until it happens to them in production. If a bad result gets cached - whether because the search pipeline returned a poor result, or because an LLM hallucinated - every semantically similar future query will be served that bad result with full confidence.

In an LLM context, a hallucinated response gets cached and then propagated to every user who asks a related question. In a search context, a miscategorised product result or a stale inventory item gets served repeatedly.

Mitigations:

  • Never cache results below a confidence floor. If the underlying search returned low-confidence or no results, do not store the empty or low-quality result. My implementation skips caching for queries below MIN_CACHEABLE_QUERY_CHARS = 3 and for result sets with low relevance scores.
  • Implement feedback-driven invalidation. If downstream signals (bounce rate, click-through, explicit negative feedback) indicate a result was poor, invalidate the cache entry immediately rather than waiting for TTL expiry.
  • Log what you cache. This seems obvious, but many implementations cache silently. You cannot diagnose poisoning you cannot observe.

3. The Stale Data Problem

A Time-To-Live (TTL) expiry is the standard approach to cache freshness, and it works adequately for static content. It fails for data that changes on business timescales rather than clock timescales - product pricing, inventory status, promotional eligibility.

A cached result for “Nike Air Max size 10” becomes stale the moment that product goes out of stock, its price changes, or it enters a flash sale. A fixed TTL of one hour means users may see up to one hour of incorrect results. For an e-commerce platform, that can directly translate to failed purchases and customer complaints.

The solution in my pipeline is to decouple TTL from time entirely for high-velocity data, using event-driven invalidation instead.

4. Model Lock-In

When you store an embedding in your cache, you are storing a vector in the vector space of the model that generated it. If you upgrade your embedding model - which you will, because models improve - those stored vectors are no longer valid. The new model generates vectors in a different semantic space, so cosine similarity comparisons between old embeddings (generated by model A) and new query embeddings (generated by model B) are meaningless.

This is not a hypothetical concern. My own upgrade from all-MiniLM-L6-v2 (384 dimensions) to BAAI/bge-base-en-v1.5 (768 dimensions, 109M parameters, BERT-base architecture) required a full cache invalidation and rebuild. Every stored embedding had to be discarded and recomputed. On a large catalogue, this is a significant operational event.

Mitigations:

  • Version your embeddings. Store the model identifier alongside each embedding in the cache. This enables you to selectively invalidate only the entries generated by the old model rather than purging everything indiscriminately.
  • Plan for rebuilds. Cache warming after a model upgrade requires re-embedding every stored query. For a large production system, do this incrementally and in the background - do not wait for users to repopulate the cache through natural traffic.
  • Evaluate before you migrate. Run both models in parallel on a sample of your query traffic and compare result quality before committing to a migration.

5. Streaming and Multi-Turn Context

Two scenarios where semantic caching reliably breaks UX and correctness:

Streaming responses (LLM context): Most modern LLM interfaces stream tokens as they are generated. Semantic caches store complete responses and return them instantly - which means a cached hit will display the full result immediately rather than streaming it character by character. Depending on your UI, this either looks broken or feels jarring. Either handle this at the presentation layer by simulating a stream from the cached response, or disable semantic caching for streaming-first interfaces.

Multi-turn conversations: Semantic similarity is computed on the current query in isolation. In a multi-turn conversation, the same query text can mean entirely different things depending on context. “What about the red one?” is meaningless without knowing what was discussed three turns ago. A semantic cache that does not account for conversation context will happily serve cached responses to questions that have completely different answers depending on conversational state. The safest approach is to disable semantic caching for multi-turn flows, or incorporate a conversation context hash into the cache key.

6. Infrastructure Complexity vs. Actual Savings

This is the honest one. Before instrumenting, monitoring, threshold-tuning, handling model upgrades, building invalidation logic, and managing a vector index at scale, you should honestly ask: does my LLM or compute cost actually justify this investment?

For an application spending tens of thousands of dollars monthly on LLM API calls, semantic caching with even a 15% hit rate can justify the overhead quickly. For an application processing a few hundred queries per day, the engineering time to build and maintain this correctly likely costs more than the compute you save.

Run the numbers before you build.


Freshness Without Fear: Adaptive TTL and Event-Driven Invalidation

Standard TTL-based expiry treats all cache entries equally - every entry expires after the same fixed interval, regardless of how frequently it is accessed or how volatile its underlying data is. This is a blunt instrument.

My implementation uses a three-state freshness model:

  • ACTIVE - now < soft_expires_at - served normally
  • SOFT_EXPIRED - soft_expires_at ≤ now < hard_expires_at - served immediately, but a background refresh is triggered asynchronously to recompute and update the cached result
  • HARD_EXPIRED - now ≥ hard_expires_at - never served; full search execution is forced

The default soft TTL is 600 seconds (10 minutes). The hard TTL is 3600 seconds (60 minutes). These are not fixed - the soft TTL extends adaptively based on how frequently an entry is accessed:

def _adaptive_soft_ttl(hit_count: int) -> int:
    multiplier = 1.0 + log10(max(hit_count, 1))
    return min(L2_SOFT_TTL * multiplier, ADAPTIVE_TTL_CEILING)

A cold entry with one hit stays at 600 seconds. A hot entry with 1,000 hits sees its soft window extend to approximately 2,400 seconds. This means high-traffic queries stay fresh longer and incur fewer background refreshes, while low-traffic queries cycle out quickly and do not occupy cache space indefinitely.

For product data, time-based TTL alone is insufficient. A price update at 2:47 PM should not wait until 3:47 PM to be reflected in search results. The solution is event-driven invalidation through Kafka.

The ingestion worker consumes a product-updates Kafka topic and applies targeted invalidation based on event type:

Event TypeActionRationale
PRICE_UPDATESOFT_EXPIREDServe existing result briefly while refresh runs
INVENTORY_UPDATESOFT_EXPIREDSame - short staleness acceptable
PROMOTION_UPDATESOFT_EXPIREDPromotional relevance changes quickly but not instantly
FULL_UPDATE (core fields)HARD_EXPIREDTitle, category, or brand changed - never serve stale

This granularity matters. Marking everything HARD_EXPIRED on any product change would spike latency every time a bulk import runs. The tiered approach ensures that minor data changes (price drift, stock fluctuation) degrade gracefully through the soft expiry path, while structural changes (product renamed, recategorised) force immediate recomputation.

PostgreSQL’s NOTIFY mechanism then propagates the invalidation signal from the database to all running service instances, ensuring L1 (in-process) caches are also cleared for affected entries.


Observability: What to Actually Measure

Deploying a semantic cache without instrumentation is building in the dark. These are the metrics that actually matter in production:

Cache layer hit distribution

Track hits at each layer separately:

l1_hits            - in-process cache hits
l2_hits            - exact query match hits  
lexical_near_hits  - pg_trgm similarity hits (≥ 0.76)
semantic_hits      - vector similarity hits (≥ 0.84)

The ratio across layers tells you where your cache is doing work and where it is not. If l1_hits is low relative to traffic, your L1 TTL may be too short or your query diversity is too high for in-process caching to be worthwhile. If semantic_hits is negligible, either your threshold is too conservative or your workload is genuinely diverse.

Freshness-aware hit splits

active_cache_hits        - hits served from fully fresh entries
soft_expired_cache_hits  - hits served while background refresh is pending
hard_expired_rejects     - entries present but refused due to hard expiry

A rising soft_expired_cache_hits rate is not a problem - it is the system working as intended. A rising hard_expired_rejects rate indicates either your hard TTL is too short or your event-driven invalidation is marking too many entries as hard-expired on minor data changes.

Corrected cache rate

As discussed: the fraction of cache hits that were subsequently identified as incorrect. Keep this below 1%. If it rises, your similarity threshold needs tightening.

Cache build and invalidation lag

How long does it take from a Kafka product-updates event to full cache invalidation across all service instances? If this lag exceeds your acceptable staleness window, your invalidation pipeline has a bottleneck.

HNSW index health

pgvector 0.8.0+ exposes index build metrics. Monitor recall by periodically comparing HNSW approximate search results against exact sequential scan results on a sample of queries. If recall degrades, your ef_search parameter (default: 40 in pgvector) may need tuning for your data distribution.


Choosing Your Embedding Model

The embedding model is the foundation of your semantic cache. Every decision downstream - threshold tuning, storage cost, inference latency - flows from this choice. Choosing the wrong one early means a full cache invalidation and rebuild when you switch, so it is worth getting right before you go to production.

My own path went from all-MiniLM-L6-v2 to BAAI/bge-base-en-v1.5. The specs tell part of the story:

ModelDimensionsParametersArchitecture
all-MiniLM-L6-v238422.7MMiniLM
BAAI/bge-base-en-v1.5768109MBERT-base

all-MiniLM-L6-v2 is fast and lightweight - genuinely useful for edge deployments or resource-constrained environments. For production retrieval quality, it is outclassed. Multiple independent benchmarks confirm that bge-base-en-v1.5 significantly outperforms MiniLM on retrieval tasks.

The more important difference for semantic caching specifically is similarity score distribution. all-MiniLM-L6-v2 produces cosine scores that cluster together - the gap between a genuine semantic match and a superficially similar query is narrow. This makes thresholds hard to tune reliably: shifting from 0.84 to 0.88 may barely change your hit rate, but shift below 0.80 and you start serving wrong results. BGE v1.5 was released in September 2023 specifically to address poor similarity distribution in the earlier v1 series, and the difference in practice is real - you get cleaner separation between genuine matches and near-misses, which makes the two-zone threshold model much easier to calibrate.

For production workloads:

  • Self-hosted, moderate scale: BAAI/bge-base-en-v1.5 is a strong default. 768 dimensions, max 512-token input, compatible with sentence-transformers out of the box.
  • Higher accuracy requirements: BAAI/bge-large-en-v1.5 improves retrieval further at the cost of inference speed and memory - benchmark it against your workload before committing.
  • Multilingual requirements: The BGE-M3 family is designed for multilingual retrieval - relevant if your search handles non-English queries.

One operational note: bake your embedding model into your container image at build time. Pulling a model from a remote hub at startup introduces latency, network dependency, and versioning ambiguity. Pin the model version and treat it like any other dependency.


Embedding Model Upgrade: Doing It Without Breaking Production

Model upgrades deserve their own section because the operational risk is underestimated. When you move from one embedding model to another, you are not just swapping a dependency - you are invalidating your entire semantic cache corpus.

The safest migration path:

  1. Version-stamp all stored embeddings. Add a model_version column to your cache table. Every entry stores which model generated its embedding.

  2. Deploy the new model in parallel. New queries begin generating embeddings with the new model. Old cached entries remain queryable using the old model for a transition period. This requires routing logic that selects the correct model for comparison based on the model_version of the cached entry.

  3. Warm the new cache incrementally. Re-embed your most frequently accessed queries (sort by hit_count descending) first. These are the entries where cache coverage matters most. Do this as a background job, not a blocking migration.

  4. Hard-expire old model entries. Once the new model cache reaches acceptable coverage, mark all old-model entries as HARD_EXPIRED. They will be replaced naturally as queries hit them.

  5. Drop the old model. Once old-model entries have cycled out, remove the old model dependency.

This approach avoids the cold start problem - where your cache is completely empty immediately after a migration - and limits the blast radius of any new model performing worse than expected on your specific workload.


When Semantic Caching Is Worth It (and When It Is Not)

Semantic caching is not universally applicable. Here is an honest assessment:

It is worth it when:

  • Your workload contains significant semantic redundancy - a meaningful fraction of queries are paraphrases of previously seen queries
  • The computation being cached is expensive in either cost (LLM API tokens) or latency (ANN search over large corpora)
  • Your data is stable enough that cached results remain valid for minutes to hours, not seconds
  • You have the observability infrastructure to detect and correct threshold misconfiguration
  • You are operating at a scale where the engineering overhead is amortised

It is not worth it when:

  • Every query is unique - agentic workflows, highly personalised queries, or multi-turn conversations where context changes the meaning of every message
  • Your data changes faster than any meaningful TTL window - real-time inventory with per-second updates will race past any soft expiry you set
  • You are pre-launch - shadow mode requires stable traffic to calibrate thresholds, and you cannot tune against traffic you do not yet have
  • Your team cannot sustain the operational overhead - a misconfigured semantic cache that serves wrong results with confidence is worse than no cache at all

The inflection point I have seen in practice is roughly this: if you are making more than a few hundred LLM API calls per hour and your query patterns repeat meaningfully, the numbers work out. Below that volume, you will spend more engineering time tuning thresholds than you will ever recover in API savings. Run the numbers before you build.


Closing Thoughts

Semantic caching is one of those systems that looks deceptively simple in a prototype and reveals its complexity only in production. The concept - match by meaning rather than identity - is straightforward. What catches most teams off guard is not the theory. It is the first time a model upgrade invalidates months of cached embeddings and the hit rate drops to zero overnight, or the first time a poisoned result propagates silently to thousands of users before anyone notices.

The architecture described here - a 4-layer pipeline with L1 in-process caching, L2 exact matching, lexical near-hit detection, and two-zone semantic scoring - is not theoretical. It is grounded in real engineering decisions made against real constraints. The Kafka-driven invalidation model, the adaptive TTL formula, the intent conflict detection, the Prometheus instrumentation - each of these exists because a simpler approach proved insufficient.

If you take one thing from this post, let it be this: run in shadow mode before you serve from cache. Understand your own workload before you commit to a threshold. And instrument everything - because a semantic cache you cannot observe is a liability, not an asset.

The code for the search service discussed in this post is available in the hopnshoppe repository for reference. If you are tuning thresholds for a different domain, start by reading the pgvector documentation on HNSW index tuning and the Portkey guide on semantic caching thresholds - both are worth the hour.