
Tenant Data Isolation: Patterns and Anti-Patterns
Explore effective patterns and pitfalls of tenant data isolation in multi-tenant systems to enhance security and compliance.
Jul 30, 2025
Read More
Every conversational AI agent has a fundamental constraint: the context window. It's the maximum amount of text the model can process in a single request—the agent's working memory. Exceed it and your agent loses track of the conversation, hallucinates, or crashes entirely.
GPT-4o supports 128K tokens. Claude 3.5 Sonnet handles 200K. Gemini 1.5 Pro stretches to 2M. These numbers sound massive, but in practice, a 30-minute customer support conversation with tool calls, retrieved documents, and system prompts can fill 128K tokens faster than you'd expect.
This guide covers the architectural patterns for building conversational agents that handle context intelligently—keeping relevant information accessible while staying within token budgets.
The context window includes everything the model processes in one call:
A simple formula for available context:
available_for_history = (
model_context_limit
- system_prompt_tokens
- retrieved_context_tokens
- current_query_tokens
- max_output_tokens
- safety_buffer # ~500 tokens
)
For a 128K model with a 1,500-token system prompt, 3,000 tokens of RAG context, and 4,096 max output tokens, you have roughly 119,000 tokens for conversation history. That's about 90,000 words—plenty for most conversations, but multi-day agent sessions with heavy tool use can exhaust this.
There are four primary strategies, each with different trade-offs:
| Strategy | Complexity | Information Loss | Best For |
|---|---|---|---|
| Sliding Window | Low | High (drops old messages) | Simple chatbots |
| Summarization | Medium | Medium (compressed history) | Long conversations |
| RAG-backed Memory | High | Low (searchable history) | Knowledge-heavy agents |
| Hybrid (Summary + RAG) | High | Very Low | Production agents |
The simplest approach: keep the last N messages and drop everything older. It's a FIFO queue for conversation history.
class SlidingWindowMemory:
def __init__(self, max_tokens: int = 8000):
self.messages = []
self.max_tokens = max_tokens
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
self._trim()
def _trim(self):
total = sum(self._count_tokens(m["content"]) for m in self.messages)
while total > self.max_tokens and len(self.messages) > 1:
removed = self.messages.pop(0)
total -= self._count_tokens(removed["content"])
def get_messages(self):
return self.messages.copy()
def _count_tokens(self, text: str) -> int:
return len(text) // 4 # Rough approximation
When to use: Quick prototypes, chatbots where conversation history beyond 10 turns doesn't matter, customer support bots with short interactions.
Limitation: The agent literally forgets. If a user mentioned their name in message 1 and you've dropped it, the agent can't recall it in message 20.
Instead of dropping old messages, summarize them. The agent maintains a running summary of the conversation that gets updated periodically.
class SummarizingMemory:
def __init__(self, llm_client, summary_threshold: int = 4000):
self.llm = llm_client
self.summary = ""
self.recent_messages = []
self.summary_threshold = summary_threshold
async def add(self, role: str, content: str):
self.recent_messages.append({"role": role, "content": content})
recent_tokens = sum(
self._count_tokens(m["content"]) for m in self.recent_messages
)
if recent_tokens > self.summary_threshold:
await self._compress()
async def _compress(self):
# Keep last 2 messages, summarize the rest
to_summarize = self.recent_messages[:-2]
self.recent_messages = self.recent_messages[-2:]
history_text = "\n".join(
f"{m['role']}: {m['content']}" for m in to_summarize
)
prompt = f"""Summarize this conversation segment concisely.
Preserve: user preferences, decisions made, key facts, action items.
Drop: pleasantries, redundant information.
Existing summary: {self.summary}
New messages:
{history_text}
Updated summary:"""
self.summary = await self.llm.complete(prompt)
def get_context(self):
parts = []
if self.summary:
parts.append({"role": "system", "content": f"Conversation summary: {self.summary}"})
parts.extend(self.recent_messages)
return parts
Trade-off: Summarization costs tokens (you're making an extra LLM call), and some detail is inevitably lost. But the agent retains the gist of the entire conversation indefinitely.
Store every message in a vector database. When the agent needs historical context, it retrieves the most relevant past messages using semantic search.
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
import uuid
class RAGMemory:
def __init__(self, qdrant_url: str, embed_fn):
self.client = QdrantClient(url=qdrant_url)
self.embed = embed_fn
self.collection = "conversation_memory"
async def store(self, session_id: str, role: str, content: str, turn: int):
embedding = await self.embed(content)
self.client.upsert(
collection_name=self.collection,
points=[PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"session_id": session_id,
"role": role,
"content": content,
"turn": turn,
"timestamp": time.time()
}
)]
)
async def retrieve(self, session_id: str, query: str, top_k: int = 5):
embedding = await self.embed(query)
results = self.client.search(
collection_name=self.collection,
query_vector=embedding,
query_filter={"must": [{"key": "session_id", "match": {"value": session_id}}]},
limit=top_k
)
return [hit.payload for hit in results]
Advantage: The agent can recall any detail from the conversation, even from hours ago. Search is semantic, so asking "what did we discuss about pricing?" retrieves pricing-related messages regardless of exact wording.
Limitation: Retrieval adds latency (50-200ms), costs embedding tokens, and might miss context that wasn't in the top-K results.
Production agents typically combine all three strategies:
The context sent to the LLM looks like this:
[System Prompt]
[Conversation Summary: "User is building a fintech MVP. They chose React Native..."]
[Retrieved Context: relevant past messages based on current query]
[Last 5 messages in full]
[Current user message]
This gives the agent the best of all worlds: immediate context from recent messages, broad awareness from the summary, and precise recall from vector search.
Users expect agents to remember across sessions. "Last week we discussed X" should work. This requires persistent memory beyond the conversation:
Allocate your context window like a budget:
| Component | Token Budget | Notes |
|---|---|---|
| System prompt | 1,000-2,000 | Keep tight; every token here costs on every request |
| Conversation summary | 500-1,000 | Compress aggressively |
| Retrieved context (RAG) | 2,000-4,000 | Top 3-5 chunks |
| Recent messages | 3,000-8,000 | Last 5-10 turns |
| Current query | 100-500 | User's message |
| Output reserve | 2,000-4,096 | Model's response |
| Safety buffer | 500 | Tokenizer estimation errors |
Monitor actual usage against these budgets. If retrieved context regularly exceeds its budget, your chunks are too large or you're retrieving too many.
At Propelius Technologies, we build conversational AI agents with hybrid memory architectures that balance recall, cost, and latency. Our agents maintain context across sessions while keeping token costs predictable.
Start by measuring your typical conversation length in tokens. Most customer support conversations fit within 8K-16K tokens. If your agent handles complex multi-step tasks with document retrieval, you'll need 32K-128K. Don't default to the maximum—larger contexts cost more and can degrade quality. Use the hybrid memory approach to keep effective context small while maintaining full recall.
It can, which is why you should use explicit instructions about what to preserve: user preferences, decisions, action items, and key facts. Pair summarization with RAG-backed memory so specific details can still be retrieved on demand. In practice, a well-tuned summarization prompt preserves 90%+ of actionable information at 10-20% of the original token count.
Store conversation data in a persistent vector database (Qdrant, Pinecone, Weaviate) indexed by user ID. At the start of each session, retrieve relevant past context using the user's first message as a search query. Maintain a structured user profile with key facts that persists across all sessions. This gives the agent continuity without replaying entire conversation histories.
Significantly. With GPT-4o at $2.50/1M input tokens, sending 100K tokens per request costs $0.25 per request. At 1,000 requests per day, that's $250/day just for input tokens. By managing context to average 10K tokens per request, you'd pay $25/day—a 10x reduction. Context management is one of the highest-ROI optimizations for AI agent costs.
Need an expert team to provide digital solutions for your business?
Book A Free CallDive into a wealth of knowledge with our unique articles and resources. Stay informed about the latest trends and best practices in the tech industry.
View All articlesTell us about your vision. We'll respond within 24 hours with a free AI-powered estimate.
© 2026 Propelius Technologies. All rights reserved.