AI Tokenomics — Understand & Optimise Token Spend

01 — Fundamentals

What is a Token?

Tokens are the atomic unit of AI computation — not characters, not words, but sub-word fragments that every LLM uses to read input and write output. Understanding them is understanding the bill.

Type to see live tokenisation:

≈ 0 tokens · 0 chars · ratio: 0 chars/token

1 token ≈ ¾ of an English word

Common words are 1 token. Complex or rare words split into multiple tokens. Code is densest — often 1 token per character.

Output tokens cost 4–10× more

GPT-5: $1.25 input vs $10.00 output per MTok. A model that writes a lot costs far more than one that reads a lot.

Reasoning models have hidden token cost

Models like o1 or Claude Extended Thinking generate internal chain-of-thought tokens that are billed but never shown to you.

The full conversation is resent every call

All prior messages + system prompt + new message are sent on every API call. A 40-turn chat can carry 25,000+ input tokens silently.

How One API Call Consumes Tokens

INPUT

System Prompt

Your instructions to the model. 200–2,000 tokens. Sent on every single call — prime candidate for prompt caching.

→

INPUT

Chat History

All prior messages in the conversation. Grows every turn. The silent multiplier — often the largest input cost driver.

→

INPUT

User Message + Context

The actual query plus any retrieved documents (RAG). Usually the smallest component but grows with retrieval.

→

OUTPUT 4–10×

Model Response

What the model writes back. Billed at 4–10× the input rate. Longer responses = exponentially higher cost. The most expensive component per token.

Three Ways to Buy AI Tokens

📦

Packaged SaaS

Per-seat subscription (e.g. Microsoft Copilot ~$30/user/mo). Tokens are invisible — bundled into the vendor's price. Low control, high simplicity. Risk: you cannot optimise what you cannot see.

Low visibility · Predictable cost

🔌

API Access

Pay per token, metered in real time. Full visibility, full volatility. Costs scale with every prompt length decision and model choice you make. Best for builders who want control.

Full transparency · Variable cost

🏭

AI Factory (Self-hosted)

Own or co-locate the GPUs. Tokens emerge from capex decisions. Maximum sovereignty, lowest per-token cost at scale (84B+ tokens/year). Requires MLOps capability and significant upfront investment.

Maximum control · High capex

02 — Market Pricing

Token Prices Across Providers

USD per million tokens (MTok) — April 2026. Output tokens cost significantly more than input. Always model the input:output ratio for your specific workload before comparing.

Model	Provider	Tier	Input $/MTok	Output $/MTok	Cached Input	Context	Best For
GPT-5.2 Pro	OpenAI	Flagship	$21.00	$168.00	$2.10	200K	Hardest reasoning, executive tasks
Claude Opus 4.6	Anthropic	Flagship	$5.00	$25.00	$0.30	200K	Risk reviews, high-stakes analysis
o1	OpenAI	Reasoning	$15.00	$60.00	$7.50	200K	Complex multi-step logic, planning
GPT-5.2	OpenAI	Mid-Tier	$1.75	$14.00	$0.175	200K	Coding, agentic workflows
GPT-5	OpenAI	Mid-Tier	$1.25	$10.00	$0.125	128K	General flagship, copilots
Claude Sonnet 4.6	Anthropic	Mid-Tier	$3.00	$15.00	$0.30	1M	Enterprise copilots, knowledge workflows
Gemini 2.5 Pro	Google	Mid-Tier	$1.25	$10.00	$0.31	1M	Multimodal, long-context analysis
GPT-4.1	OpenAI	Mid-Tier	$2.00	$8.00	$0.20	1M	Product UIs, multi-turn workflows
GPT-5 Mini	OpenAI	Fast & Cheap	$0.25	$2.00	$0.025	200K	Automation, high-volume batch
Claude Haiku 4.5	Anthropic	Fast & Cheap	$0.80	$4.00	$0.08	200K	Realtime copilots, support bots
Gemini 2.5 Flash	Google	Fast & Cheap	$0.30	$2.50	$0.03	1M	High-volume summarisation, triggered automations
DeepSeek V3.2	DeepSeek	Fast & Cheap	$0.28	$0.42	$0.028	128K	Best value per token, 90% cache discounts
Gemini 2.0 Flash-Lite	Google	Fast & Cheap	$0.075	$0.30	—	1M	Cheapest mainstream option, simple tasks
Llama 4 Maverick	Meta	Open Source	$0.15	$0.60	—	1M	Open weights, fine-tuning, sovereignty
Llama 3.3 70B	Meta	Open Source	$0.10	$0.32	—	131K	5–14× cheaper than GPT-4o at comparable quality
Mistral Small 3.2	Mistral	Open Source	$0.07	$0.20	—	128K	Ultra-low cost European open model
Mistral Nemo	Mistral	Open Source	$0.02	$0.04	—	131K	Absolute lowest API cost, simple extraction tasks

Prices from TLDL LLM Pricing 2026, IntuitionLabs Comparison, PricePerToken.com. Prices change frequently — verify directly with each provider before budgeting.

Real-World Cost Scenarios

Chatbot — 800 in / 400 out tokens/turn · 10K users · 20 turns/day

Gemini 2.0 Flash$14/mo

DeepSeek V3.2$23/mo

GPT-5 Mini$60/mo

Claude Haiku 4.5$168/mo

Claude Sonnet 4.6$504/mo

RAG Pipeline — 8,000 in / 800 out per query · 50K queries/month

DeepSeek V3.2$128/mo

Gemini 2.5 Flash$220/mo

GPT-5 Mini$180/mo

Gemini 2.5 Pro$900/mo

Claude Sonnet 4.6$1,800/mo

Code Generation — 2,000 in / 1,500 out per request · 500 req/day

Llama 3.3 70B~$20/mo

DeepSeek V3.2$18/mo

GPT-5.2$367/mo

Claude Sonnet 4.6$427/mo

Claude Opus 4.6$712/mo

03 — Core Optimisation

10 Levers to Cut Token Spend

These are proven, production-tested techniques. Combined, organisations routinely achieve 40–80% cost reduction. Start with the high-impact ones — they require minimal engineering effort.

Biggest Lever

Prompt Caching (KV Cache)

Providers cache the key-value matrices of repeated prompt prefixes. This is the single highest-impact optimisation — up to 90% cheaper on cached input tokens with minimal code changes.

2026 Provider Cache Pricing

Anthropic (Claude): $3.00 write → $0.30/MTok cached (90% off)
OpenAI (auto-caching): 50% off input tokens, no code change needed
AWS Bedrock: Up to 90% cost reduction, 85% latency reduction
Minimum 1,024 tokens needed; TTL ~5 minutes standard
Break-even at just 1.4 cache reads per cached prefix

Implementation: Place stable content first — system prompt, docs, tool definitions. Place dynamic content last — user queries, session data. Anthropic: use cache_control parameter for explicit breakpoints. Target 70%+ cache hit rate. Rate limits don't count against cached reads (Claude 3.7+).

Up to 90% off input tokens

High Impact

Right-Size Model Selection (Routing)

Route tasks to the cheapest model that meets the quality bar. A cascade strategy sends simple tasks to Flash/Haiku/Mini and escalates to Sonnet/GPT-5 only when required. FAQ bots do not need frontier models.

Routing framework

Classifier layer (cheap model) assesses query complexity
Simple (FAQ, classification, extraction) → Haiku / Flash-Lite
Standard (summarisation, drafting) → Sonnet / GPT-5 Mini
Complex (analysis, reasoning, code) → GPT-5 / Claude Opus

Implementation: Tools like Portkey and LiteLLM support rule-based and ML-based routing with millisecond overhead. Llama 3.3 70B is 5–14× cheaper than GPT-4o at comparable quality for most tasks. Open-source models via Groq or Together AI at $0.10/MTok deliver near-zero marginal cost at scale.

Up to 60–80% reduction with smart routing

High Impact

Truncate & Summarise Conversation History

Every message in a multi-turn chat is resent on every call. A 40-turn conversation carries 25,000+ input tokens — the largest invisible cost driver. Sliding window truncation + summarisation eliminates this.

Implementation patterns

Sliding window: Keep only last N turns
Summarisation: After 10 turns, compress older history to 1 paragraph
Structured state: Extract facts into a key-value store, not raw chat
Topic clear: Reset context on new user intent/topic
Max messages limit: Hard-cap turns per session in agent platforms

Implementation: Use a cheap model (Haiku, Flash-Lite) to summarise older context before the main call. The summarisation cost is negligible vs the savings. Context engines with this pattern achieve 40–60% input reduction.

40–70% context token reduction

High Impact

Semantic Caching

Instead of caching exact strings, semantic caching uses vector embeddings to serve cached responses when queries are semantically similar (≥0.85–0.95 cosine similarity). Avoids the API call entirely for repetitive queries.

2026 production benchmarks

Cache hit rates: 60–85% in support/FAQ workloads
API call reduction: up to 68.8% fewer calls
Latency: 1.67s → 0.052s per cache hit (96.9% faster)
Cost reduction: up to 73% on conversational workloads
31% of LLM queries show semantic similarity — often untapped

Implementation: Redis with LangCache, Portkey, Helicone, or Bifrost all support semantic caching in 2026 via one-line gateway integration. Namespace by model + provider to avoid cross-contamination. Skip caching for conversations exceeding ~10 turns to reduce false positives.

50–73% on semantically repetitive workloads

Medium Impact

Concise Prompt Engineering

Verbose prompts don't produce better outputs — they just cost more. Filler phrases like "You are a helpful, professional, knowledgeable assistant" add tokens with zero information gain. Telling a model to "be concise" reduces output tokens by 57–59%.

Before → After

You are a helpful, knowledgeable, friendly, professional customer support agent who always responds in a polite and courteous manner and ensures customer satisfaction...
↓ 55% fewer tokens, same output quality
You are a customer support agent. Be concise and accurate.

Implementation: Audit every system prompt. Remove filler, pleasantries, and redundant instructions. LLMLingua prompt compression achieves 20× token reduction with only 1.5% quality loss — available as a plug-and-play LangChain/LlamaIndex integration.

30–55% prompt size reduction

Medium Impact

Constrain Output Length & Format

Output tokens cost 4–10× more than input. Setting max_tokens limits and requesting structured JSON instead of prose cuts the most expensive part of every call.

Control techniques

Set explicit max_tokens parameter per call type
Request JSON/structured output schemas: ~15% token reduction vs prose
Use stop sequences to prevent unnecessary continuation
Request bullet responses: "Answer in 3 bullet points"
Use "be concise" instruction: 57–59% output reduction (OPSDC research)

Implementation: Map output length requirements to use-case type. FAQ = 1–2 sentences. Summary = 3–5 bullets. Analysis = structured JSON. Set max_tokens as a hard upper bound for each use-case tier.

15–60% output token reduction

Medium Impact

RAG Context Compression

RAG pipelines often dump entire document chunks into context — including sentences irrelevant to the query. Pre-filtering retrieved chunks to only query-relevant sentences cuts RAG input tokens by 50–80% with maintained accuracy.

Optimised RAG pipeline

Retrieve top-K chunks (e.g. 10 chunks)
Score each sentence for query relevance (cheap model)
Pass only high-relevance sentences (~20% of retrieved text)
Reduce Top-K from 10 to 3–5 via hybrid search
Result: same answer quality, 50–80% fewer tokens

Implementation: Use a small model (Flash-Lite, Haiku) or a BM25/reranker to score and filter context before passing to the main LLM. One production case study: cost per contract dropped from expensive to $0.91, with 40% latency reduction.

25–80% RAG context reduction

Medium Impact

Batch Processing

Most providers offer 50% discounts for asynchronous batch API calls. Non-real-time workloads — nightly reports, document analysis, bulk classification — are ideal candidates with zero quality trade-off.

Batch API discount (2026)

OpenAI Batch API: 50% off all token costs
Anthropic Message Batches: 50% off standard pricing
Use cases: doc processing, bulk tagging, analytics, embeddings
Processing time: minutes to 24 hours (vs milliseconds sync)

Implementation: Audit all AI calls for real-time necessity. Most analytics, classification, and reporting workflows can be shifted to batch. No engineering complexity — just use the batch endpoint.

50% off all eligible non-real-time workloads

Targeted Impact

Avoid Reasoning Models for Simple Tasks

Reasoning models (o1, Claude Extended Thinking) generate internal chain-of-thought tokens before responding — billed but invisible. For a FAQ bot or classifier, you may be paying 5–20× more than necessary.

When to use reasoning models

✓ Multi-step mathematical/logical deduction
✓ Code debugging with complex dependency chains
✓ Long-horizon agentic planning
✓ Legal/financial analysis requiring step-by-step reasoning
✗ Summarisation, classification, translation, FAQ

Implementation: Implement a reasoning-model gate: only route to o1/extended-thinking when a classifier scores the task as requiring multi-step inference. Default to standard models.

5–20× cost reduction vs always-on reasoning

Enabler

Fine-Tuning for Repetitive High-Volume Tasks

Fine-tuning embeds task knowledge directly into the model, eliminating the need for lengthy few-shot examples in every prompt. For stable, high-volume tasks, fine-tuned smaller models can outperform generic larger ones at a fraction of the cost.

Economics (2026)

GPT-4.1 fine-tuning: $25/MTok training tokens
Inference savings: 40–60% fewer tokens per call
Self-hosted Llama fine-tuned: savings reach 20–25× vs proprietary API
Break-even: typically ~1M calls on the specific task
Only viable for stable tasks with 10K+ calls/month

Implementation: Only invest in fine-tuning for tasks where: (1) instructions are stable, (2) volume is high, (3) few-shot examples are large and repetitive. Avoid for evolving tasks or low-volume use cases.

40–75% long-term inference reduction

Combined Optimisation Potential

Prompt caching (KV cache)

90%

Semantic caching

73%

Model routing / right-sizing

60–80%

History truncation & summarisation

40–70%

Concise prompt engineering

30–55%

RAG context compression

25–80%

Batch processing

50%

Output length constraints

15–60%

* Savings are on applicable token spend per technique — not additive across all calls. Most production systems achieve 70–80% total spend reduction by combining prompt caching + model routing + context management. Source: Obvious Works 2026, Redis LLMOps Guide.

04 — Advanced Techniques

Infrastructure-Level Optimisation

Beyond prompt-level changes, these infrastructure techniques address token costs at the compute layer — particularly relevant for self-hosted deployments and high-scale agentic systems.

Self-Hosted

KV Cache Compression

For self-hosted deployments, the KV (key-value) cache is the dominant memory bottleneck. Recent research shows 70–90% memory reduction is achievable with minimal accuracy loss — enabling longer contexts on the same hardware.

2026 state-of-the-art (ICLR 2026)

Google TurboQuant: 6× memory reduction, zero accuracy loss, no calibration
NVIDIA KVTC: Up to 20× compression via PCA + entropy coding
FastKV: 1.82× faster prefill + 2.87× faster decoding vs baseline
FP8/INT4 quantisation: 2–4× memory reduction, supported in vLLM natively

Implementation: Use vLLM with PagedAttention for 14–24× higher throughput vs naive implementations. Apply FP8 KV quantisation on NVIDIA Hopper/Blackwell GPUs. For cutting-edge: TurboQuant/KVTC land at ICLR April 2026.

6–20× memory reduction → lower hardware cost

Self-Hosted

Speculative Decoding

Pair the target model with a lightweight draft model that proposes multiple tokens simultaneously. The target model verifies the batch in a single forward pass — achieving 2–5× faster generation with identical output quality.

2026 benchmarks

Standard speculative decoding: 2–4× faster inference
Speculative Speculative Decoding (Saguaro): 5× vs autoregressive
14–17% throughput gain on Oracle OCI with A100 GPUs
Zero accuracy degradation — output is mathematically equivalent

Implementation: Supported natively in vLLM, SGLang, and TRT-LLM. DeepSeek uses Multi-Token Prediction (MTP) heads as a draft mechanism. EAGLE/EAGLE-2 are the most widely deployed speculative decoding variants as of 2026.

2–5× throughput increase → lower GPU hours per token

Agentic Systems

Agentic Workflow Optimisation

Multi-agent systems can consume 4–15× more tokens than single-agent calls if not carefully orchestrated. Parallel execution, tool fusion, and model tiering within agent graphs dramatically reduce token overhead.

Key agentic cost patterns

DAG-based topologies: Parallel instead of sequential tool calls
Tool fusion: Combine related tool calls → 12–40% token reduction
Model tiering: Haiku for sub-tasks, Sonnet for orchestration, Opus for core reasoning
Agent cost pre-estimation: Use LLM to evaluate plan cost before execution
Token quotas: Hard-cap tokens per agent per session to prevent runaway costs

Implementation: Set monthly quota limits ($) in agent platforms. Set execution limits per minute to prevent runaway loops. Set time limits per conversation. Track cost per agent per task. Source: Tonic3 Agentic Budget Framework 2025.

12–40% token reduction in multi-agent systems

Infrastructure

Model Quantisation (Self-hosted)

Quantisation reduces model weight precision from FP16 to INT8 or INT4, cutting memory requirements by 2–4× with minimal quality loss. Enables running larger models on cheaper hardware.

Quantisation options

FP8 (8-bit): 2× memory savings, near-zero quality loss
INT4 (4-bit): 4× memory savings, <5% accuracy delta on most tasks
Llama 70B at INT4: runs on 2× A100 80GB vs 4× in FP16
Libraries: bitsandbytes, AutoGPTQ, GGUF (llama.cpp)
H100 GPUs: 80% more cost-efficient per token vs older hardware

Implementation: Use INT8 for production serving as the conservative default. Use INT4 for less critical inference workloads where you've tested quality. Always benchmark quality on your specific task before deploying quantised models.

2–4× memory reduction → lower infrastructure cost

05 — Infrastructure Decision

Build vs Buy: The Hosting Decision

Where your model runs shapes per-token economics as much as which model you choose. At low volumes, API access wins. At scale, the economics flip decisively. Two-thirds of enterprises are now repatriating AI workloads on-premise (Finout 2026).

☁️

API Access

Pure OpexInstant setup

✓ No upfront investment
✓ Instant start, infinite scale
✓ Latest models available immediately
✓ Best for spiky / exploratory workloads

✗ Highest per-token cost
✗ Costs scale linearly, unpredictably
✗ No data sovereignty control
✗ Vendor lock-in risk

$0.075–$168 per MTok output

Best below ~7B tokens/month

⚡

Neocloud (NCP)

Pure OpexInstant

✓ Purpose-built for AI workloads
✓ Lower latency than hyperscalers
✓ Dynamic GPU provisioning
✓ Good mid-point before full ownership

✗ No control over physical layer
✗ High on-demand price variability
✗ External data residency risk

~$1–$4/GPU hour

Cheaper than API at 49B+ tokens/year

🏭

AI Factory (On-Prem)

Capex ModelHigh control

✓ Lowest per-token cost at scale
✓ Full data sovereignty
✓ Open-source models (free inference)
✓ Custom fine-tuning, no vendor lock-in

✗ Large upfront capex
✗ Multi-month procurement
✗ MLOps expertise required
✗ GPU obsolescence risk (annual release cycles)

~$1–$2/GPU hour amortised

Wins decisively at 84B+ tokens/year

TCO Inflection Points (Deloitte Simulation)

49B tokens/yr

NCP becomes cheaper per-token than API access

67B tokens/yr

AI factory per-token cost drops below API access

84B tokens/yr

AI factory beats both API and NCP on per-token TCO

3-year horizon

AI factory delivers 50%+ savings vs API at equivalent token volumes

Year 1
10B tokens

API $1.06M

NCP $0.97M

Factory $0.49M

Year 2
300B tokens

API $3.50M

NCP $2.72M

Factory $1.45M

API Neocloud AI Factory

Source: Deloitte "The Pivot to Tokenomics" — AI Economics Report 2025

AI Factory TCO Breakdown (10B tokens/year = $1.45M)

Compute (GPUs) 53% · $125,080

Largest direct cost. NVIDIA HGX B200 GPUs, high-bandwidth memory, accelerators. Dominant cost at 10B token scale.

Facilities, Power & Cooling 17% · $40,670

AI GPU racks draw 250–300kW vs 10–15kW for standard servers. Liquid cooling required at scale.

Software & Licensing 17% · $39,646

AI frameworks, orchestration tools, MLOps platforms, compliance tooling, enterprise support.

Networking 13% · $31,302

InfiniBand/NVLink GPU interconnects. High-bandwidth switches. Contributes 10–20% of TCO typically.

06 — AI FinOps

Token Governance & FinOps

You cannot optimise what you cannot see. AI FinOps is the emerging practice of applying cloud financial governance discipline to token-based spending — and it's now the #1 priority for FinOps teams in 2026.

State of FinOps 2026 (FinOps Foundation): 98% of respondents now use FinOps to manage AI spend (up from 31% in 2024). 58% cite AI cost management as their most desired skill addition. 33% named FinOps for AI as their top current or future priority — ahead of all others. Yet 80% of companies still miss AI infrastructure cost forecasts by more than 25%.

FinOps Maturity Model for Token Spend

Level 1 — Inform

Observability

Log input + output tokens per API call
Tag every request with model name, team, use-case
Build dashboards: tokens/user, cost/app, cost/department
Set monthly budget alerts by business unit
Surface prompt efficiency metrics

Level 2 — Optimise

Cost Reduction

Activate prompt caching on all eligible system prompts
Implement model routing rules per use-case type
Apply concise prompt engineering across all templates
Shift non-real-time workloads to batch endpoints
Compress RAG context before sending to LLM

Level 3 — Operate

Governance

Track cost-per-unit-of-value (cost per resolved ticket, etc.)
Monthly review: top 10 token-consuming call types
Set hard quotas: $ per agent/month, calls/minute throttling
Establish model tier policies per use-case type
Auto-routing layer for least-cost capable model selection

Monitoring & Observability Tools (2026)

Langfuse

Open-source tracing · Self-hostable

Token cost dashboards, prompt versioning, multi-framework support. Best general-purpose LLM observability.

Helicone

AI Gateway · Open-source (Apache 2.0)

1-line integration (swap base URL). 300+ models. Intelligent caching + auto failover. 1T+ tokens/day processed across 24K+ orgs.

Portkey

AI Gateway · Routing + Fallback

Intelligent routing, load balancing, fallbacks. MCP Gateway for enterprise agent governance. Best for multi-provider reliability.

LangSmith

LangChain-native observability

Deep LangChain integration, trace capture, cost attribution. Best for LangChain-based pipelines.

AgentOps

AI Agent monitoring · Open-source

Session replay, agent-level tracing, cost per agent per task. Best for agentic systems.

Surveil / Finout

Enterprise AI FinOps

Token-level attribution, virtual tagging, anomaly detection, chargeback/showback by BU. Enterprise-grade cost governance.

The New Currencyof Enterprise AI

What is a Token?

How One API Call Consumes Tokens

Three Ways to Buy AI Tokens

Packaged SaaS

API Access

AI Factory (Self-hosted)

Token Prices Across Providers

Real-World Cost Scenarios

10 Levers to Cut Token Spend

Prompt Caching (KV Cache)

Right-Size Model Selection (Routing)

Truncate & Summarise Conversation History

Semantic Caching

Concise Prompt Engineering

Constrain Output Length & Format

RAG Context Compression

Batch Processing

Avoid Reasoning Models for Simple Tasks

Fine-Tuning for Repetitive High-Volume Tasks

Combined Optimisation Potential

Infrastructure-Level Optimisation

KV Cache Compression

Speculative Decoding

Agentic Workflow Optimisation

Model Quantisation (Self-hosted)

Build vs Buy: The Hosting Decision

API Access

Neocloud (NCP)

AI Factory (On-Prem)

TCO Inflection Points (Deloitte Simulation)

AI Factory TCO Breakdown (10B tokens/year = $1.45M)

Token Governance & FinOps

FinOps Maturity Model for Token Spend

Observability

Cost Reduction

Governance

Monitoring & Observability Tools (2026)

Token Cost Calculator

Your Workload

Apply Optimisation Levers

The New Currency
of Enterprise AI