We Open-Sourced the Math Behind LLM Cost Rankings
15 models. 6 providers. Weekly price tracking. Here's the formula.
Every LLM provider shows you $/1M tokens.
That number looks precise. It isn't.
$/1M tokens is not your monthly cost. It's a unit price detached from workload reality.
The difference between those two can be thousands of dollars per year. GPT-5.1 at $1.25/1M input looks cheap until you run 10,000 messages a day with 500-token outputs at $10.00/1M — and suddenly you're at $1,537/month.
Most teams only discover this after the invoice arrives.
We built LLM Cost Engine to calculate real monthly costs, deterministically. No opinions, no hidden weights, no vendor deals.
This is the exact math we use.
LLM Monthly Cost Estimation: The Math Behind Every Number
Four inputs, three price dimensions, one deterministic output.
M = Messages per day
Ti = Input tokens per message
To = Output tokens per message
Cr = Cache hit rate (0.00 - 1.00)
Daily cost:
C_input_fresh = (M × Ti × (1 - Cr)) / 1,000,000 × P_input
C_input_cached = (M × Ti × Cr) / 1,000,000 × P_cached
C_output = (M × To) / 1,000,000 × P_output
Monthly cost = (C_input_fresh + C_input_cached + C_output) × 30That's it. This is the exact formula in our codebase. Every number in the calculator traces back to this math.
LLM Cost Breakdown: 3 Real-World Scenarios
Same formula applied to three common production workloads. Numbers verified with the calculator above.
🎧 Customer Support Chatbot — 10K messages/day
200 input tokens · 350 output tokens · 30% cache hit rate
| Model | Input/mo | Output/mo | Total/mo |
|---|---|---|---|
| DeepSeek V3 | $12.60 | $115.50 | $128 |
| GPT-5.1 | $63.75 | $1,050.00 | $1,114 |
| Claude Sonnet 4.6 | $131.40 | $1,575.00 | $1,706 |
Same workload. Same formula. 13× cost variance between cheapest and most expensive model.
📚 RAG Knowledge Base — 1K messages/day
15,000 input tokens · 500 output tokens · 80% cache hit rate
| Model | Input/mo | Output/mo | Total/mo |
|---|---|---|---|
| DeepSeek V3 | $49.50 | $16.50 | $66 |
| Gemini 3 Flash | $90.00 | $45.00 | $135 |
| Claude Sonnet 4.6 | $378.00 | $225.00 | $603 |
80% cache rate keeps input costs low even at 15K tokens/request. 9× variance — but without cache, it would be 5× worse.
⌨️ Internal Dev Productivity Bot — 500 messages/day
800 input tokens · 1,200 output tokens · 15% cache hit rate
| Model | Input/mo | Output/mo | Total/mo |
|---|---|---|---|
| DeepSeek V3 | $2.88 | $19.80 | $23 |
| GPT-5.1 | $13.88 | $180.00 | $194 |
| Claude Sonnet 4.6 | $31.14 | $270.00 | $301 |
High output ratio (1,200 tokens out) makes output cost dominate. 13× variance driven almost entirely by output price differences.
Run these scenarios with your own numbers — or adjust parameters to match your exact workload.
Try in the simulator →Why $/Token Doesn't Reflect Real Monthly Usage
Most calculators treat all tokens equally. They shouldn't.
| Model | Input $/1M | Output $/1M | Ratio |
|---|---|---|---|
| GPT-5.1 | $1.25 | $10.00 | 8x |
| Claude Opus 4.6 | $5.00 | $25.00 | 5x |
| Gemini 3 Flash | $0.50 | $3.00 | 6x |
Output tokens cost 4–5x more than input across every major model. A chatbot generating long responses has a fundamentally different cost profile than a pipeline extracting short JSON. We separate them because real workloads aren't symmetric.
→ Try this scenario with your output/input ratio in the simulator
Prompt Caching: The Biggest Hidden Saving in LLM Deployments
Prompt caching discounts are the most overlooked cost lever in LLM pricing:
| Model | Standard Input | Cached Input | Discount |
|---|---|---|---|
| Claude Sonnet 4.6 | $3.00 | $0.30 | 90% |
| Gemini 3 Flash | $0.50 | $0.125 | 75% |
| DeepSeek V3 | $0.28 | $0.028 | 90% |
| GPT-5.1 | $1.25 | $0.125 | 90% |
A support bot with a static system prompt hitting 80% cache rate pays dramatically less than one with dynamic prompts. If a model doesn't publish a cached price, we fall back to the standard input price. No assumptions.
→ Check how cache hit rate changes your monthly cost estimate
LLM Cost Ranking: A Deterministic, Auditable Method
Cost alone doesn't determine the best model. A model at $0.01/month with an 8K context window and 3-second latency isn't "best value."
ValueScore = (1 / Cost)^0.65 × log10(Context)^0.35 × LatencyIndexThree factors. Fixed weights. No manual overrides.
| Factor | Weight | Rationale |
|---|---|---|
| (1 / Cost)^0.65 | 65% | This is a cost calculator. Cost dominates. |
| log10(Context)^0.35 | 35% | Context matters, but 1M vs 2M is marginal. Log scale captures diminishing returns. |
| LatencyIndex | 0–1 | Sourced from benchmarks. Fast models score higher. |
The weights are named constants in a source file you can read: VALUESCORE_ALPHA = 0.65, VALUESCORE_BETA = 0.35. Not buried in logic. Intentionally transparent.
What ValueScore does NOT do:
- Measure output quality, reasoning, or coding ability
- Use subjective assessments or crowdsourced ratings
- Change based on who sponsors us (nobody does)
If two people enter the same inputs, they get the same ranking. Always.
Transparent Methodology: Why We Show the Math
This is our design principle.
Benchmarking tools that hide their methodology ask you to trust them. We'd rather show you the math.
- The pricing dataset is public JSON: llm-pricing.json
- The ValueScore formula is four lines of TypeScript
- The weights are named constants, not magic numbers
- If you disagree with
ALPHA = 0.65, fork the logic and set your own
A tool that produces different rankings depending on who is paying it isn't a tool — it's an ad.
LLM Cost Reduction via Smart Routing: The 80/20 Strategy
Not every query needs your most expensive model. Route 80% to GPT-5 Mini, 20% to Claude Sonnet 4.6:
($6.75 × 0.80) + ($45.00 × 0.20) = $14.40/mo vs $45.00/mo = 68% savingsOur simulator calculates this for any pair of models in the registry, in real-time.
Price Tracking: Weekly, Automated, Public
We snapshot all 15 models' pricing every Sunday via automated cron. Every change is recorded — even increases.
When a price drops ≥ 5%, subscribers get an automatic digest email. No account needed, just an email address with double opt-in.
The pricing dataset, detection logic, and alert threshold are public. Nothing is hidden behind an API.
Get notified when prices drop ≥5%
Free. Double opt-in. Unsubscribe anytime.
What We Don't Do
- No live latency benchmarking. Our latency index comes from provider docs and third-party data, not our own measurements. Artificial Analysis does this better.
- No output quality evaluation. ValueScore is a cost metric, not a capability benchmark. Always test with your actual prompts.
- No vendor partnerships. We don't receive compensation from any LLM provider. All pricing is sourced from official pages and verified weekly.
- No rate limit modeling. The cheapest model might throttle at 1,000 RPM. We don't capture this.
Transparency requires admitting what you don't measure.
Try It
If you're making model decisions based on pricing tables alone, you're likely underestimating real deployment cost.
Enter your workload. Compare up to 5 models. Enable routing. Run sensitivity at 2x and 3x.
Open the CalculatorIf you disagree with the weights, fork the math.