Tokens & Costs

Language models do not process text character by character. They break text into tokens — chunks that average roughly 4 characters in English. Understanding tokens helps you predict costs, optimize context usage, and choose the right model for each task.

What Tokens Are

A token is the smallest unit a language model reads and generates. Some examples:

"hello" → 1 token
"Hello, world!" → 4 tokens
A 1,000-word English document → approximately 1,300 tokens
A block of Python code → typically more tokens than equivalent English prose (punctuation and indentation each consume tokens)

The exact tokenization varies by model family. GPT-4o, Claude, and Llama each use different tokenizers. QARK handles the differences — you see consistent token counts regardless of provider.

Three Types of Tokens

Every API call involves up to three token categories, each priced independently:

Input Tokens

Everything you send to the model: the system prompt, conversation history, tool definitions, retrieved RAG context, and your latest message. Input tokens are the largest controllable cost factor.

Output Tokens

Everything the model generates in response: the assistant message, tool call arguments, and structured output. Output tokens are typically priced 2–5x higher than input tokens per unit.

Thinking Tokens

When chain-of-thought thinking is enabled, some models (notably Anthropic’s Claude with extended thinking) produce internal reasoning tokens. These appear in collapsible thinking blocks in the conversation but are billed separately. Not all models charge for thinking tokens — check the model registry for pricing details.

The Cost Formula

Every message cost follows this calculation:

cost = (input_tokens × input_price_per_token)
     + (output_tokens × output_price_per_token)
     + (thinking_tokens × thinking_price_per_token)

Prices come from the model registry, which QARK keeps current. For example, with a model priced at $3.00 per million input tokens and $15.00 per million output tokens:

Sending 2,000 input tokens and receiving 500 output tokens costs: (2000 × $0.000003) + (500 × $0.000015) = $0.006 + $0.0075 = $0.0135

Context Window vs Max Output Tokens

Two limits define a model’s capacity — the model registry displays both:

Context window — The total number of tokens the model can process in a single request (input + output combined). Ranges from 4,096 tokens on smaller models to 200,000+ on frontier models.
Max output tokens — The maximum number of tokens the model can generate in a single response. Always smaller than the context window. Typically 4,096–16,384 tokens, though some models support up to 64,000+.

If your input consumes most of the context window, the model has fewer tokens available for its response. QARK’s context strategies help you manage this tradeoff.

Where to Monitor Costs

QARK surfaces cost data at three levels of granularity:

Per-Message Badge

Assistant message with token/cost breakdown badge showing input tokens, output tokens, and calculated cost

Every assistant message displays a badge showing:

Input token count
Output token count
Thinking token count (when applicable)
Calculated cost in USD

Click the badge to see the full breakdown including the model used and per-token pricing.

Conversation Info Tab

Open the Info Panel for any conversation. The overview shows:

Total tokens consumed (input + output + thinking, summed across all messages)
Total cost for the entire conversation
Number of API calls made

Budget Dashboard

The budget dashboard aggregates costs across all conversations, grouped by:

Time period (daily, weekly, monthly)
Provider
Model

Set monthly budget limits to receive warnings when spending approaches your threshold.

Estimated Tokens During Streaming

While a response streams in, QARK cannot know the final token count until the API returns usage data. During streaming, QARK estimates output tokens using:

estimated_tokens = content.length / 4

This approximation (based on the ~4 characters per token average) updates in real time as content arrives. The badge switches to exact counts once the response completes and the API reports actual usage.

Cost Optimization Strategies

Choose a Smaller Context Strategy

Context strategies that send fewer historical messages reduce input tokens. A sliding window of the last 10 messages costs significantly less than sending the full conversation history. The token_budget strategy lets you set an explicit cap on input tokens per request.

Use Cheaper Models for Iteration

Reserve frontier models for final outputs. During iteration — debugging, brainstorming, drafting — a model priced at $0.25 per million input tokens performs adequately for many tasks and costs 12x less than a $3.00/M model.

Run Local Models at Zero Cost

Models served through Ollama or other local inference engines incur no per-token charges. QARK tracks token counts for local models (useful for context management) but reports $0.00 cost.

Leverage Model Speed and Intelligence Ratings

The model registry includes speed and intelligence ratings for each model. Sort by these ratings to find the best cost-performance tradeoff. A model rated 8/10 intelligence at $1.00/M input tokens may outperform one rated 9/10 at $15.00/M for your specific use case.

Monitor Compaction Costs

When QARK compacts a conversation (summarizing older messages to free context space), the compaction itself consumes tokens. These costs are tracked separately in the conversation’s cost breakdown so you can distinguish between productive work and housekeeping overhead.