Model Management

QARK fetches the full model list from each connected provider’s API and caches it for 5 minutes. Connect a provider, and every model it offers appears automatically — no manual entry.

Browse models

Open Settings → Providers → [Provider] to see all available models. The list refreshes every 5 minutes, so newly released models appear shortly after their API availability.

Filter by name, capability, or category to narrow results when a provider offers dozens of models.

Model details

Select any model to see its full spec sheet:

Field	What it tells you
Context window	Maximum input tokens per request — ranges from 8K (Gemma 2 9B) to 2M (Grok 4.1 Fast)
Max output tokens	Upper bound on generated tokens per response — up to 128K on flagship models
Thinking mode	Whether the model supports chain-of-thought reasoning: adaptive (model decides), manual (you toggle), always (cannot disable), or none
Vision	Whether the model accepts image inputs — screenshots, diagrams, photos, documents
Tool use	Whether the model can call tools via the @mention system
Pricing	Input and output cost per million tokens, pulled from the provider. Thinking tokens may have separate pricing
Speed rating	Relative latency ranking — first-token time and tokens-per-second
Intelligence rating	Relative reasoning and output quality ranking

These details come from QARK’s model registry, which tracks specs for every model across all providers. The registry updates with each app release.

Control visibility

Hide models you don’t use to keep the model picker clean:

Hide individual models — remove specific models from the picker without affecting others from the same provider.
Hide model groups — suppress an entire family in one action (e.g., all legacy GPT-3.5 variants, all Gemma 2B models).

Hidden models won’t appear in the model picker or as category default options. Unhide them anytime from the provider’s settings panel.

Default models across 9 categories

QARK uses models for more than conversation. Each category can have a different default model assigned:

Category	What it powers
Chat	Primary conversational responses
Sparks	Quick completions from the global overlay
Compaction	Summarizing older messages when context fills up
Embedding	Converting text to vectors for document search
RAG Generation	Generating answers grounded in retrieved document chunks
Reranking	Cross-encoder re-scoring of search results
Image Extraction	Extracting text and structure from images in documents
Image Generation	Creating images from text prompts
Video Generation	Producing video from text or image prompts

Set defaults in Settings → Providers → Model Defaults. QARK uses the assigned model for that category everywhere unless you override it per conversation.

Strategy tip: Assign a fast, cheap model for compaction (it runs frequently in long conversations) and a high-quality model for chat. Use a local Ollama model for embedding to avoid per-token costs entirely.

Switch models mid-conversation

You’re not locked into a model for the duration of a conversation. Open the model picker from any active conversation to switch. Your full conversation history carries over — the new model picks up with the entire context intact.

Practical uses:

Draft with a fast model (GPT-4.1 Nano, Llama 3.1 8B), then switch to a frontier model (Claude Opus 4.6, GPT-5.4) for the final pass.
Start with a thinking model for complex reasoning, switch to a non-thinking model for straightforward follow-ups.
Compare outputs by switching models and regenerating the same message.

Each model’s provider accent color appears on the tab border, so you can see at a glance which model is active across split panes.

Budget management

QARK tracks every token and every dollar across all providers. You can set per-provider monthly spending limits and monitor usage in real time.

Set budgets

Open Settings → Budget. Each connected provider has:

Enable/disable toggle — turn budget enforcement on or off per provider.
Monthly limit — set a dollar amount (e.g., $20/mo for OpenAI, $50/mo for Anthropic). The limit resets on the first of each calendar month.

Screenshot: Budget settings showing per-provider cards with progress bars, toggle switches, and monthly limit inputs

Cost tracking

Every LLM call records a cost entry to an append-only ledger:

Tracked	Detail
Input tokens	Tokens sent to the model (your messages + context)
Output tokens	Tokens generated by the model
Thinking tokens	Chain-of-thought tokens (tracked separately — some models price these differently)
Cost (USD)	Calculated from the model’s per-million-token pricing
Purpose	What the call was for: chat, embedding, compaction, etc.
Model	Exact model used

The cost ledger is append-only — entries survive even if you delete the conversation or messages they belong to. Historical spending data is never lost.

Per-message cost appears as a badge on each response (tokens in/out + USD). Per-conversation totals are visible in the Info panel.

Budget enforcement

QARK checks your budget before every LLM call:

At 80% of limit — a warning toast appears: “Approaching budget limit for [provider]. $X of $Y (Z%).” The message still sends.
At 100% of limit — the message is blocked. QARK returns an error: “Monthly budget exceeded for [provider].” Switch to a different provider or increase the limit to continue.

Usage dashboard

The budget dashboard in Settings → Budget shows:

Current month summary — total spending across all providers with a stacked bar chart. The top 3 spenders are color-coded; remaining providers are grouped as “Others.”

Spending timeline — the last 3 months (expandable), with month-over-month trend indicators (percentage up or down). Click any month to expand per-provider breakdown.

Provider cards — sorted by spending, each showing:

Amount spent this month
Progress bar: green (under 80%), yellow (80–99%), red (at or over 100%)
Remaining budget
Link to detailed usage history

Usage history — per-provider modal with monthly breakdown table: month, message count, input tokens, output tokens, cost. Expand any row to see model-level breakdown sorted by cost. All-time totals at the bottom.

Cost optimization strategies

Strategy	How
Cheap models for drafts	Use GPT-4.1 Nano or Llama 3.1 8B for iteration, frontier models for final output
Local models for zero cost	Ollama or LM Studio for embedding, compaction, or low-stakes chat
Token budget context strategy	Cap how many tokens of history are sent per request
Auto-compact threshold	Lower the compaction threshold to summarize context sooner, reducing input tokens
Skip reranking	Reranking adds a second LLM call per search — disable it if vector search alone is accurate enough
Monitor thinking tokens	Models with always-on thinking (Grok 4, DeepSeek R1) generate thinking tokens on every message — these add up