
Tenant Data Isolation: Patterns and Anti-Patterns
Explore effective patterns and pitfalls of tenant data isolation in multi-tenant systems to enhance security and compliance.
Jul 30, 2025
Read More
LLM costs spiral fast. What starts as $500/month in prototyping becomes $10K/month in production. Token prices seem small ($0.01 per 1K tokens), but at scale — millions of requests, long prompts, multiple model calls — the bills add up.
This guide covers proven strategies to cut LLM spend by 40-60% without sacrificing quality.
| Cost Driver | % of Bill | Optimization Potential |
|---|---|---|
| Model choice | 40-50% | High (switch models) |
| Prompt length | 20-30% | High (compression) |
| Output length | 15-25% | Medium (set max tokens) |
| Redundant calls | 10-20% | High (caching) |
| Failed requests | 5-10% | Medium (retries, validation) |
Not every task needs GPT-4. Match model capability to task complexity:
| Model | Cost (per 1M tokens) | Best For |
|---|---|---|
| GPT-4-turbo | $10 in / $30 out | Complex reasoning, code generation |
| GPT-3.5-turbo | $0.50 in / $1.50 out | Simple Q&A, classification, summaries |
| Claude Haiku | $0.25 in / $1.25 out | Fast responses, high-volume tasks |
| Llama 3.1 8B (self-hosted) | $0.10-0.20 total | When data privacy matters |
Task-based routing pattern:
def route_to_model(task_type, complexity):
if task_type == "code_generation" or complexity == "high":
return "gpt-4-turbo"
elif task_type == "classification":
return "claude-haiku" # Cheapest, fast
else:
return "gpt-3.5-turbo" # Default workhorse
# Example
model = route_to_model("summarization", "medium")
response = llm.call(model=model, prompt=prompt)
Savings: Switching 70% of calls from GPT-4 to GPT-3.5 = 70-80% cost reduction on those calls.
Long prompts burn tokens. Every 1,000 characters costs ~250 tokens.
Example optimization:
❌ Before (450 tokens):
"You are a helpful assistant. The user is Alice, who is 32 years old and works as a software engineer at TechCorp. She enjoys hiking and photography. She lives in Seattle and has been with the company for 5 years..."
✅ After (120 tokens):
"User: Alice, 32, SWE @ TechCorp (5y). Seattle. Interests: hiking, photography."
Savings: 73% token reduction on system prompts.
Cache responses at multiple levels:
Cache by meaning, not exact match:
import redis
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
redis_client = redis.Redis()
def semantic_cache_get(query, threshold=0.95):
query_emb = embedder.encode(query)
# Search cached queries for similar ones
for cached_query, cached_response in get_all_cached():
cached_emb = embedder.encode(cached_query)
similarity = cosine_similarity(query_emb, cached_emb)
if similarity > threshold:
return cached_response
return None
# Use before calling LLM
cached = semantic_cache_get(user_query)
if cached:
return cached
else:
response = llm.call(user_query)
cache_set(user_query, response)
return response
Hit rate: 20-40% for FAQ-style applications.
Claude supports caching long system prompts — 90% cost reduction on cached portions:
response = anthropic.messages.create(
model="claude-3-5-sonnet-20241022",
system=[
{
"type": "text",
"text": "You are an AI assistant...",
"cache_control": {"type": "ephemeral"}
}
],
messages=[...]
)
Savings: $0.30 → $0.03 per 1K cached tokens.
Output tokens cost 2-3x input tokens. Limit output aggressively:
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=messages,
max_tokens=150, # Hard cap
temperature=0.3 # Lower = shorter, more focused
)
Guidance in prompt: "Answer in 2-3 sentences max" or "Respond with a JSON object only, no explanation."
Savings: Reducing average output from 500 → 200 tokens = 60% savings on output cost.
50% discount for non-urgent requests:
batch = openai.Batch.create(
input_file_id=file_id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Results delivered within 24h at 50% cost
Use for: daily summaries, bulk classification, report generation.
For high-volume, low-complexity tasks:
When to self-host: Privacy requirements, >1M requests/month, or predictable workload.
Track spend in real-time:
def track_llm_cost(model, input_tokens, output_tokens, user_id=None):
cost = calculate_cost(model, input_tokens, output_tokens)
# Log to metrics system
metrics.increment("llm.cost", cost, tags={"model": model, "user": user_id})
# Alert if user exceeds daily budget
if user_id:
daily_spend = get_user_daily_spend(user_id)
if daily_spend > USER_DAILY_LIMIT:
alert(f"User {user_id} exceeded daily limit: ${daily_spend}")
Key metrics to track:
Not if done right. Model downgrading (GPT-4 → GPT-3.5) works for 70% of tasks with no quality loss. Prompt compression requires testing — aim for 30-50% reduction without cutting critical context. Cache aggressively for deterministic queries. Only sacrifice quality on low-value, high-volume tasks (e.g., tagging, simple classification).
Model selection (40-60% savings) beats everything else. Route simple tasks to GPT-3.5 or Claude Haiku instead of GPT-4. Second-best: caching (20-40% savings on redundant calls). Third: prompt compression (15-30% savings). Combine all three for 60-70% total reduction.
Only at >500K-1M requests/month or when data privacy mandates it. AWS/GCP GPU instances cost $1.50-3/hour; break-even vs OpenAI API happens around 500K-1M calls. Self-hosting adds operational burden (model updates, scaling, monitoring). Use managed APIs until cost justifies infrastructure investment.
Track cumulative cost per user per day/month. Set tiered limits: free tier ($0.50/day), paid tier ($5/day), enterprise (unlimited). When limit hit, show upgrade prompt or rate-limit requests. Use Redis counters for real-time tracking. Alert ops when any user exceeds $20/day (potential abuse or bug).
Validate inputs before calling LLM: check prompt length (<max context), sanitize user input (remove gibberish), use retries with exponential backoff (not instant retries that burn tokens). For streaming, stop generation early if output is garbage (use content filters). Set timeouts (30s max) to kill runaway requests. Failed requests still cost tokens — prevention > retry.
Need an expert team to provide digital solutions for your business?
Book A Free CallDive into a wealth of knowledge with our unique articles and resources. Stay informed about the latest trends and best practices in the tech industry.
View All articlesTell us about your vision. We'll respond within 24 hours with a free AI-powered estimate.
© 2026 Propelius Technologies. All rights reserved.