Skip to content
Download for Mac

Tokens & Costs

Language models do not process text character by character. They break text into tokens — chunks that average roughly 4 characters in English. Understanding tokens helps you predict costs, optimize context usage, and choose the right model for each task.

A token is the smallest unit a language model reads and generates. Some examples:

  • "hello" → 1 token
  • "Hello, world!" → 4 tokens
  • A 1,000-word English document → approximately 1,300 tokens
  • A block of Python code → typically more tokens than equivalent English prose (punctuation and indentation each consume tokens)

The exact tokenization varies by model family. GPT-4o, Claude, and Llama each use different tokenizers. QARK handles the differences — you see consistent token counts regardless of provider.

Every API call involves up to three token categories, each priced independently:

Everything you send to the model: the system prompt, conversation history, tool definitions, retrieved RAG context, and your latest message. Input tokens are the largest controllable cost factor.

Everything the model generates in response: the assistant message, tool call arguments, and structured output. Output tokens are typically priced 2–5x higher than input tokens per unit.

When chain-of-thought thinking is enabled, some models (notably Anthropic’s Claude with extended thinking) produce internal reasoning tokens. These appear in collapsible thinking blocks in the conversation but are billed separately. Not all models charge for thinking tokens — check the model registry for pricing details.

Every message cost follows this calculation:

cost = (input_tokens × input_price_per_token)
+ (output_tokens × output_price_per_token)
+ (thinking_tokens × thinking_price_per_token)

Prices come from the model registry, which QARK keeps current. For example, with a model priced at $3.00 per million input tokens and $15.00 per million output tokens:

  • Sending 2,000 input tokens and receiving 500 output tokens costs: (2000 × $0.000003) + (500 × $0.000015) = $0.006 + $0.0075 = $0.0135

Two limits define a model’s capacity — the model registry displays both:

  • Context window — The total number of tokens the model can process in a single request (input + output combined). Ranges from 4,096 tokens on smaller models to 200,000+ on frontier models.
  • Max output tokens — The maximum number of tokens the model can generate in a single response. Always smaller than the context window. Typically 4,096–16,384 tokens, though some models support up to 64,000+.

If your input consumes most of the context window, the model has fewer tokens available for its response. QARK’s context strategies help you manage this tradeoff.

QARK surfaces cost data at three levels of granularity:

Assistant message with token/cost breakdown badge showing input tokens, output tokens, and calculated cost

Every assistant message displays a badge showing:

  • Input token count
  • Output token count
  • Thinking token count (when applicable)
  • Calculated cost in USD

Click the badge to see the full breakdown including the model used and per-token pricing.

Open the Info Panel for any conversation. The overview shows:

  • Total tokens consumed (input + output + thinking, summed across all messages)
  • Total cost for the entire conversation
  • Number of API calls made

The budget dashboard aggregates costs across all conversations, grouped by:

  • Time period (daily, weekly, monthly)
  • Provider
  • Model

Set monthly budget limits to receive warnings when spending approaches your threshold.

While a response streams in, QARK cannot know the final token count until the API returns usage data. During streaming, QARK estimates output tokens using:

estimated_tokens = content.length / 4

This approximation (based on the ~4 characters per token average) updates in real time as content arrives. The badge switches to exact counts once the response completes and the API reports actual usage.

Context strategies that send fewer historical messages reduce input tokens. A sliding window of the last 10 messages costs significantly less than sending the full conversation history. The token_budget strategy lets you set an explicit cap on input tokens per request.

Reserve frontier models for final outputs. During iteration — debugging, brainstorming, drafting — a model priced at $0.25 per million input tokens performs adequately for many tasks and costs 12x less than a $3.00/M model.

Models served through Ollama or other local inference engines incur no per-token charges. QARK tracks token counts for local models (useful for context management) but reports $0.00 cost.

Leverage Model Speed and Intelligence Ratings

Section titled “Leverage Model Speed and Intelligence Ratings”

The model registry includes speed and intelligence ratings for each model. Sort by these ratings to find the best cost-performance tradeoff. A model rated 8/10 intelligence at $1.00/M input tokens may outperform one rated 9/10 at $15.00/M for your specific use case.

When QARK compacts a conversation (summarizing older messages to free context space), the compaction itself consumes tokens. These costs are tracked separately in the conversation’s cost breakdown so you can distinguish between productive work and housekeeping overhead.