We Open-Sourced the Math Behind LLM Cost Rankings

15 models. 6 providers. Weekly price tracking. Here's the formula.


Every LLM provider shows you $/1M tokens.

That number looks precise. It isn't.

$/1M tokens is not your monthly cost. It's a unit price detached from workload reality.

The difference between those two can be thousands of dollars per year. GPT-5.1 at $1.25/1M input looks cheap until you run 10,000 messages a day with 500-token outputs at $10.00/1M — and suddenly you're at $1,537/month.

Most teams only discover this after the invoice arrives.

We built LLM Cost Engine to calculate real monthly costs, deterministically. No opinions, no hidden weights, no vendor deals.

This is the exact math we use.


LLM Monthly Cost Estimation: The Math Behind Every Number

Four inputs, three price dimensions, one deterministic output.

M  = Messages per day
Ti = Input tokens per message
To = Output tokens per message
Cr = Cache hit rate (0.00 - 1.00)

Daily cost:
  C_input_fresh  = (M × Ti × (1 - Cr)) / 1,000,000 × P_input
  C_input_cached = (M × Ti × Cr)       / 1,000,000 × P_cached
  C_output       = (M × To)            / 1,000,000 × P_output

Monthly cost = (C_input_fresh + C_input_cached + C_output) × 30

That's it. This is the exact formula in our codebase. Every number in the calculator traces back to this math.

LLM Cost Breakdown: 3 Real-World Scenarios

Same formula applied to three common production workloads. Numbers verified with the calculator above.

🎧 Customer Support Chatbot — 10K messages/day

200 input tokens · 350 output tokens · 30% cache hit rate

ModelInput/moOutput/moTotal/mo
DeepSeek V3$12.60$115.50$128
GPT-5.1$63.75$1,050.00$1,114
Claude Sonnet 4.6$131.40$1,575.00$1,706

Same workload. Same formula. 13× cost variance between cheapest and most expensive model.

📚 RAG Knowledge Base — 1K messages/day

15,000 input tokens · 500 output tokens · 80% cache hit rate

ModelInput/moOutput/moTotal/mo
DeepSeek V3$49.50$16.50$66
Gemini 3 Flash$90.00$45.00$135
Claude Sonnet 4.6$378.00$225.00$603

80% cache rate keeps input costs low even at 15K tokens/request. 9× variance — but without cache, it would be 5× worse.

⌨️ Internal Dev Productivity Bot — 500 messages/day

800 input tokens · 1,200 output tokens · 15% cache hit rate

ModelInput/moOutput/moTotal/mo
DeepSeek V3$2.88$19.80$23
GPT-5.1$13.88$180.00$194
Claude Sonnet 4.6$31.14$270.00$301

High output ratio (1,200 tokens out) makes output cost dominate. 13× variance driven almost entirely by output price differences.

Run these scenarios with your own numbers — or adjust parameters to match your exact workload.

Try in the simulator →

Why $/Token Doesn't Reflect Real Monthly Usage

Most calculators treat all tokens equally. They shouldn't.

ModelInput $/1MOutput $/1MRatio
GPT-5.1$1.25$10.008x
Claude Opus 4.6$5.00$25.005x
Gemini 3 Flash$0.50$3.006x

Output tokens cost 4–5x more than input across every major model. A chatbot generating long responses has a fundamentally different cost profile than a pipeline extracting short JSON. We separate them because real workloads aren't symmetric.

Try this scenario with your output/input ratio in the simulator

Prompt Caching: The Biggest Hidden Saving in LLM Deployments

Prompt caching discounts are the most overlooked cost lever in LLM pricing:

ModelStandard InputCached InputDiscount
Claude Sonnet 4.6$3.00$0.3090%
Gemini 3 Flash$0.50$0.12575%
DeepSeek V3$0.28$0.02890%
GPT-5.1$1.25$0.12590%

A support bot with a static system prompt hitting 80% cache rate pays dramatically less than one with dynamic prompts. If a model doesn't publish a cached price, we fall back to the standard input price. No assumptions.

Check how cache hit rate changes your monthly cost estimate


LLM Cost Ranking: A Deterministic, Auditable Method

Cost alone doesn't determine the best model. A model at $0.01/month with an 8K context window and 3-second latency isn't "best value."

ValueScore = (1 / Cost)^0.65 × log10(Context)^0.35 × LatencyIndex

Three factors. Fixed weights. No manual overrides.

FactorWeightRationale
(1 / Cost)^0.6565%This is a cost calculator. Cost dominates.
log10(Context)^0.3535%Context matters, but 1M vs 2M is marginal. Log scale captures diminishing returns.
LatencyIndex0–1Sourced from benchmarks. Fast models score higher.

The weights are named constants in a source file you can read: VALUESCORE_ALPHA = 0.65, VALUESCORE_BETA = 0.35. Not buried in logic. Intentionally transparent.

What ValueScore does NOT do:

  • Measure output quality, reasoning, or coding ability
  • Use subjective assessments or crowdsourced ratings
  • Change based on who sponsors us (nobody does)

If two people enter the same inputs, they get the same ranking. Always.


Transparent Methodology: Why We Show the Math

This is our design principle.

Benchmarking tools that hide their methodology ask you to trust them. We'd rather show you the math.

  • The pricing dataset is public JSON: llm-pricing.json
  • The ValueScore formula is four lines of TypeScript
  • The weights are named constants, not magic numbers
  • If you disagree with ALPHA = 0.65, fork the logic and set your own

A tool that produces different rankings depending on who is paying it isn't a tool — it's an ad.


LLM Cost Reduction via Smart Routing: The 80/20 Strategy

Not every query needs your most expensive model. Route 80% to GPT-5 Mini, 20% to Claude Sonnet 4.6:

($6.75 × 0.80) + ($45.00 × 0.20) = $14.40/mo  vs  $45.00/mo = 68% savings

Our simulator calculates this for any pair of models in the registry, in real-time.


Price Tracking: Weekly, Automated, Public

We snapshot all 15 models' pricing every Sunday via automated cron. Every change is recorded — even increases.

When a price drops ≥ 5%, subscribers get an automatic digest email. No account needed, just an email address with double opt-in.

The pricing dataset, detection logic, and alert threshold are public. Nothing is hidden behind an API.

Get notified when prices drop ≥5%

Free. Double opt-in. Unsubscribe anytime.

Subscribe to Price Alerts

What We Don't Do

  1. No live latency benchmarking. Our latency index comes from provider docs and third-party data, not our own measurements. Artificial Analysis does this better.
  2. No output quality evaluation. ValueScore is a cost metric, not a capability benchmark. Always test with your actual prompts.
  3. No vendor partnerships. We don't receive compensation from any LLM provider. All pricing is sourced from official pages and verified weekly.
  4. No rate limit modeling. The cheapest model might throttle at 1,000 RPM. We don't capture this.

Transparency requires admitting what you don't measure.


Try It

If you're making model decisions based on pricing tables alone, you're likely underestimating real deployment cost.

Enter your workload. Compare up to 5 models. Enable routing. Run sensitivity at 2x and 3x.

Open the Calculator

If you disagree with the weights, fork the math.

Pricing data: v2.1.0, 15 models, 6 providers. Sourced from official pricing pages, verified 2026-02-23. Updated weekly.