
Tenant Data Isolation: Patterns and Anti-Patterns
Explore effective patterns and pitfalls of tenant data isolation in multi-tenant systems to enhance security and compliance.
Jul 30, 2025
Read More
LLM costs are the new cloud bill shock. What starts as $200/month in testing balloons to $5,000/month in production, then $20,000/month at scale. Unlike traditional infrastructure that you can optimize with caching and CDNs, AI costs scale directly with usage — every conversation, every document processed, every function call burns tokens.
But there's good news: you can cut costs 60-80% without sacrificing quality. At Propelius Technologies, we've built AI agents and automation systems for clients across industries. This guide shows you where the money goes and how to optimize it.
LLMs charge per token — roughly 0.75 words per token. Costs vary wildly by model:
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Use Case |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, high quality |
| GPT-4o-mini | $0.15 | $0.60 | Fast tasks, high volume |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Long context, analysis |
| Claude 3 Haiku | $0.25 | $1.25 | Simple tasks, speed priority |
| Gemini 1.5 Flash | $0.075 | $0.30 | Budget-conscious, simple tasks |
| Llama 3.1 (self-host) | ~$0.01-0.05 | ~$0.01-0.05 | Private data, high volume |
Key insight: GPT-4o is 16x more expensive than GPT-4o-mini. If you can route 50% of requests to the cheaper model, you cut costs in half.
Every message in your conversation history counts toward input tokens. A 50-turn conversation with 500 tokens per turn = 25K tokens of context every time the model responds.
Solution: Conversation summarization
Verbose system prompts waste tokens. Every request pays the system prompt tax.
Bad prompt (400 tokens):
You are a helpful AI assistant designed to help users with a wide variety of tasks. You should always be polite, professional, and accurate in your responses. When answering questions, please make sure to provide detailed explanations whenever possible...
Good prompt (80 tokens):
You are a support assistant. Be concise, accurate, and helpful. Cite sources when available.
Savings: 320 tokens × 10,000 requests/month = 3.2M tokens saved (~$8-48/month depending on model)
Using GPT-4o for simple tasks is like hiring a brain surgeon to give flu shots.
Task classification:
If 100 users ask "What's your refund policy?" you're paying for 100 identical responses.
Solution: Semantic caching
RAG systems retrieve documents and inject them into context. But retrieving 10 documents × 1,000 tokens = 10K tokens of potentially irrelevant context.
Optimization strategies:
Route requests to the cheapest model that can handle them. Use a classifier or heuristics:
def route_model(query):
if len(query) < 50 and not requires_reasoning(query):
return "gpt-4o-mini" # $0.15/M input
elif requires_deep_analysis(query):
return "gpt-4o" # $2.50/M input
else:
return "claude-haiku" # $0.25/M input
Remove unnecessary words, use abbreviations, structure with JSON instead of prose:
Before:
Please analyze the following customer support ticket and determine whether it should be classified as a bug report, a feature request, or a general inquiry.
After:
Classify ticket: bug|feature|inquiry
Output tokens cost 4-5x more than input tokens. Set max_tokens aggressively:
Anthropic Claude and some providers support prompt caching — repeated context (like system prompts or document context) is cached server-side and billed at 90% discount.
Use for:
OpenAI's Batch API costs 50% less but processes asynchronously (24-hour SLA). Perfect for:
Fine-tuned models cost the same per token but need shorter prompts and work with cheaper base models.
Example: Customer support bot
Stream responses and stop generation when you have enough. Useful for classification tasks where answer appears in first 20 tokens.
Provide concise tool descriptions. Avoid sending large tool outputs back to the model — summarize first.
Bad: Send entire database result (5,000 tokens) back to model
Good: Extract relevant fields (200 tokens) and send those
For high-volume, predictable workloads, self-hosting Llama 3.1 or Mixtral can be 10-50x cheaper.
Break-even analysis:
Only makes sense if you can maintain 70%+ GPU utilization.
Use keyword search (cheap) to filter candidates, then semantic search (embedding cost) on top 50 results.
Initial setup:
After optimization:
New cost breakdown:
Track these metrics:
Tools: LangSmith, Helicone, Weights & Biases, custom logging to Datadog/Grafana.
Not if you route intelligently. GPT-4o-mini and Claude Haiku perform nearly as well as flagship models on 60-70% of tasks. Run A/B tests to validate quality before switching traffic. Start by routing 10% to cheaper models and measuring user satisfaction.
Compare engineering time cost vs. monthly savings. If optimization takes 40 hours ($5,000 in eng time) and saves $500/month, break-even is 10 months. For high-volume systems saving $5K+/month, ROI is usually under 3 months.
When your monthly API bill exceeds $2,000/month AND you have consistent utilization (not spiky traffic). Below that, API pricing is hard to beat due to their economies of scale. Also consider self-hosting for data privacy or latency-sensitive applications.
Anthropic Claude has the best native support (90% discount on cached tokens). OpenAI doesn't offer native prompt caching, but you can implement semantic caching yourself with Redis + embeddings. Some third-party proxies (Helicone, Portkey) offer caching layers.
Weekly during growth phase, monthly once stable. Set alerts when daily spend exceeds 2x baseline. Review top 10 most expensive requests/conversations monthly to find optimization opportunities.
AI doesn't have to be prohibitively expensive. With smart architecture, model selection, and caching, you can deliver high-quality AI experiences at 20-40% of naive implementation costs.
Start with quick wins: Model routing and prompt compression can save 30-50% with minimal effort.
Measure everything: You can't optimize what you don't measure. Instrument your LLM calls from day one.
Never sacrifice quality for cost: Cheaper is only better if it maintains user satisfaction. A/B test everything.
At Propelius Technologies, we build cost-efficient AI agents and automation systems. Book a consultation to discuss optimizing your AI infrastructure.
Need an expert team to provide digital solutions for your business?
Book A Free CallDive into a wealth of knowledge with our unique articles and resources. Stay informed about the latest trends and best practices in the tech industry.
View All articlesTell us about your vision. We'll respond within 24 hours with a free AI-powered estimate.
© 2026 Propelius Technologies. All rights reserved.