Building Conversational AI Agents with Context Windows

Feb 23, 2026
9 min read
Building Conversational AI Agents with Context Windows

Building Conversational AI Agents with Context Windows

Every conversational AI agent has a fundamental constraint: the context window. It's the maximum amount of text the model can process in a single request—the agent's working memory. Exceed it and your agent loses track of the conversation, hallucinates, or crashes entirely.

GPT-4o supports 128K tokens. Claude 3.5 Sonnet handles 200K. Gemini 1.5 Pro stretches to 2M. These numbers sound massive, but in practice, a 30-minute customer support conversation with tool calls, retrieved documents, and system prompts can fill 128K tokens faster than you'd expect.

This guide covers the architectural patterns for building conversational agents that handle context intelligently—keeping relevant information accessible while staying within token budgets.

Context Window Fundamentals

The context window includes everything the model processes in one call:

  • System prompt: Instructions, persona, guardrails (typically 500-2,000 tokens)
  • Conversation history: All previous messages (grows with each turn)
  • Retrieved context: Documents from RAG, tool outputs, database results
  • Current query: The user's latest message
  • Output budget: Space reserved for the model's response
Context window architecture for conversational AI agents
Context window composition

A simple formula for available context:

available_for_history = (
    model_context_limit
    - system_prompt_tokens
    - retrieved_context_tokens
    - current_query_tokens
    - max_output_tokens
    - safety_buffer  # ~500 tokens
)

For a 128K model with a 1,500-token system prompt, 3,000 tokens of RAG context, and 4,096 max output tokens, you have roughly 119,000 tokens for conversation history. That's about 90,000 words—plenty for most conversations, but multi-day agent sessions with heavy tool use can exhaust this.

Context Management Strategies

There are four primary strategies, each with different trade-offs:

StrategyComplexityInformation LossBest For
Sliding WindowLowHigh (drops old messages)Simple chatbots
SummarizationMediumMedium (compressed history)Long conversations
RAG-backed MemoryHighLow (searchable history)Knowledge-heavy agents
Hybrid (Summary + RAG)HighVery LowProduction agents

Strategy 1: Sliding Window

The simplest approach: keep the last N messages and drop everything older. It's a FIFO queue for conversation history.

class SlidingWindowMemory:
    def __init__(self, max_tokens: int = 8000):
        self.messages = []
        self.max_tokens = max_tokens

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        total = sum(self._count_tokens(m["content"]) for m in self.messages)
        while total > self.max_tokens and len(self.messages) > 1:
            removed = self.messages.pop(0)
            total -= self._count_tokens(removed["content"])

    def get_messages(self):
        return self.messages.copy()

    def _count_tokens(self, text: str) -> int:
        return len(text) // 4  # Rough approximation

When to use: Quick prototypes, chatbots where conversation history beyond 10 turns doesn't matter, customer support bots with short interactions.

Limitation: The agent literally forgets. If a user mentioned their name in message 1 and you've dropped it, the agent can't recall it in message 20.

Strategy 2: Progressive Summarization

Instead of dropping old messages, summarize them. The agent maintains a running summary of the conversation that gets updated periodically.

class SummarizingMemory:
    def __init__(self, llm_client, summary_threshold: int = 4000):
        self.llm = llm_client
        self.summary = ""
        self.recent_messages = []
        self.summary_threshold = summary_threshold

    async def add(self, role: str, content: str):
        self.recent_messages.append({"role": role, "content": content})
        
        recent_tokens = sum(
            self._count_tokens(m["content"]) for m in self.recent_messages
        )
        if recent_tokens > self.summary_threshold:
            await self._compress()

    async def _compress(self):
        # Keep last 2 messages, summarize the rest
        to_summarize = self.recent_messages[:-2]
        self.recent_messages = self.recent_messages[-2:]
        
        history_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in to_summarize
        )
        
        prompt = f"""Summarize this conversation segment concisely.
Preserve: user preferences, decisions made, key facts, action items.
Drop: pleasantries, redundant information.

Existing summary: {self.summary}

New messages:
{history_text}

Updated summary:"""
        
        self.summary = await self.llm.complete(prompt)

    def get_context(self):
        parts = []
        if self.summary:
            parts.append({"role": "system", "content": f"Conversation summary: {self.summary}"})
        parts.extend(self.recent_messages)
        return parts
Progressive summarization pipeline for AI conversation memory
Progressive summarization architecture

Trade-off: Summarization costs tokens (you're making an extra LLM call), and some detail is inevitably lost. But the agent retains the gist of the entire conversation indefinitely.

Strategy 3: RAG-Backed Conversation Memory

Store every message in a vector database. When the agent needs historical context, it retrieves the most relevant past messages using semantic search.

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
import uuid

class RAGMemory:
    def __init__(self, qdrant_url: str, embed_fn):
        self.client = QdrantClient(url=qdrant_url)
        self.embed = embed_fn
        self.collection = "conversation_memory"
        
    async def store(self, session_id: str, role: str, content: str, turn: int):
        embedding = await self.embed(content)
        self.client.upsert(
            collection_name=self.collection,
            points=[PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding,
                payload={
                    "session_id": session_id,
                    "role": role,
                    "content": content,
                    "turn": turn,
                    "timestamp": time.time()
                }
            )]
        )

    async def retrieve(self, session_id: str, query: str, top_k: int = 5):
        embedding = await self.embed(query)
        results = self.client.search(
            collection_name=self.collection,
            query_vector=embedding,
            query_filter={"must": [{"key": "session_id", "match": {"value": session_id}}]},
            limit=top_k
        )
        return [hit.payload for hit in results]

Advantage: The agent can recall any detail from the conversation, even from hours ago. Search is semantic, so asking "what did we discuss about pricing?" retrieves pricing-related messages regardless of exact wording.

Limitation: Retrieval adds latency (50-200ms), costs embedding tokens, and might miss context that wasn't in the top-K results.

Strategy 4: Hybrid Memory Architecture

Production agents typically combine all three strategies:

  1. Short-term: Last 5-10 messages in full (sliding window)
  2. Medium-term: Running summary of the full conversation (summarization)
  3. Long-term: All messages stored in vector DB, retrieved on demand (RAG)

The context sent to the LLM looks like this:

[System Prompt]
[Conversation Summary: "User is building a fintech MVP. They chose React Native..."]
[Retrieved Context: relevant past messages based on current query]
[Last 5 messages in full]
[Current user message]

This gives the agent the best of all worlds: immediate context from recent messages, broad awareness from the summary, and precise recall from vector search.

Multi-Session Memory

Users expect agents to remember across sessions. "Last week we discussed X" should work. This requires persistent memory beyond the conversation:

  • User profile store: Key facts about the user (name, preferences, company, past decisions) stored in a structured database.
  • Cross-session RAG: All conversations indexed in the same vector collection, searchable across sessions.
  • Entity memory: Extract and store entities (people, companies, projects) mentioned in conversations.
Multi-session memory architecture for conversational AI
Multi-session memory layers

Token Budgeting in Practice

Allocate your context window like a budget:

ComponentToken BudgetNotes
System prompt1,000-2,000Keep tight; every token here costs on every request
Conversation summary500-1,000Compress aggressively
Retrieved context (RAG)2,000-4,000Top 3-5 chunks
Recent messages3,000-8,000Last 5-10 turns
Current query100-500User's message
Output reserve2,000-4,096Model's response
Safety buffer500Tokenizer estimation errors

Monitor actual usage against these budgets. If retrieved context regularly exceeds its budget, your chunks are too large or you're retrieving too many.

Common Pitfalls

  • Stuffing the context window: More context isn't always better. Studies show LLM performance degrades with very long contexts (the "lost in the middle" problem). Keep context relevant and concise.
  • Ignoring token costs: A 128K context window filled on every request costs 10-50x more than a well-managed 8K window. Budget matters at scale.
  • No conversation reset: Long-running agent sessions accumulate stale context. Provide a way to start fresh or archive old threads.
  • Forgetting tool outputs: Tool call results (database queries, API responses) can be massive. Summarize them before adding to context.

At Propelius Technologies, we build conversational AI agents with hybrid memory architectures that balance recall, cost, and latency. Our agents maintain context across sessions while keeping token costs predictable.

FAQs

How do I choose the right context window size for my agent?

Start by measuring your typical conversation length in tokens. Most customer support conversations fit within 8K-16K tokens. If your agent handles complex multi-step tasks with document retrieval, you'll need 32K-128K. Don't default to the maximum—larger contexts cost more and can degrade quality. Use the hybrid memory approach to keep effective context small while maintaining full recall.

Does conversation summarization lose important details?

It can, which is why you should use explicit instructions about what to preserve: user preferences, decisions, action items, and key facts. Pair summarization with RAG-backed memory so specific details can still be retrieved on demand. In practice, a well-tuned summarization prompt preserves 90%+ of actionable information at 10-20% of the original token count.

How do I implement memory across multiple sessions?

Store conversation data in a persistent vector database (Qdrant, Pinecone, Weaviate) indexed by user ID. At the start of each session, retrieve relevant past context using the user's first message as a search query. Maintain a structured user profile with key facts that persists across all sessions. This gives the agent continuity without replaying entire conversation histories.

How much does context window usage affect costs?

Significantly. With GPT-4o at $2.50/1M input tokens, sending 100K tokens per request costs $0.25 per request. At 1,000 requests per day, that's $250/day just for input tokens. By managing context to average 10K tokens per request, you'd pay $25/day—a 10x reduction. Context management is one of the highest-ROI optimizations for AI agent costs.

Need an expert team to provide digital solutions for your business?

Book A Free Call

Related Articles & Resources

Dive into a wealth of knowledge with our unique articles and resources. Stay informed about the latest trends and best practices in the tech industry.

View All articles
Get in Touch

Let's build somethinggreat together.

Tell us about your vision. We'll respond within 24 hours with a free AI-powered estimate.

🎁This month only: Free UI/UX Design worth $3,000
Takes just 2 minutes
* How did you hear about us?
or prefer instant chat?

Quick question? Chat on WhatsApp

Get instant responses • Just takes 5 seconds

Response in 24 hours
100% confidential
No commitment required
🛡️100% Satisfaction Guarantee — If you're not happy with the estimate, we'll refine it for free
Propelius Technologies

You bring the vision. We handle the build.

facebookinstagramLinkedinupworkclutch

© 2026 Propelius Technologies. All rights reserved.