Skip to content
Download for Mac

Performance Tuning

QARK gives you direct control over the tradeoffs between speed, cost, and output quality. This guide covers the tuning levers that matter most.

The context strategy determines how QARK manages conversation history as it grows. Choose per conversation in the Config tab or set a default per agent.

StrategySpeedCostBest For
noneFastestZero overhead (but no history)Sparks, Flows, stateless tool agents
token_budgetFastPredictable ceilingCost-sensitive workflows with a hard token cap
last_nFastPredictableSimple conversations where only recent messages matter
first_nFastPredictablePreserving initial system context
allVariableGrows linearlyShort conversations under 10 turns
auto_compactModerateModerateLong-running sessions that need history access

auto_compact triggers a compaction pass when context usage crosses a threshold (default: 70% of the model’s context window). The compaction model summarizes older messages, freeing space for new content.

  • Threshold — Raise to 85% for fewer compaction passes. Lower to 50% for tighter context windows and faster turns.
  • Content overflow strategy — Choose between truncate (fast, drops oldest content) and summarize (preserves meaning, costs tokens).
  • Compaction model — Set in Settings → Sparks & Flows → Default compaction model. A fast, inexpensive model works here — compaction quality matters less than speed. A local model (Ollama) makes compaction essentially free.

Every model in QARK’s registry carries speed and intelligence ratings. When latency matters more than peak reasoning:

  • Models rated 4–5 stars for speed typically respond in under 2 seconds for short prompts.
  • Models rated 1–2 stars may take 5–15 seconds but produce stronger reasoning.
  • For multi-step Flows or agent chains, a speed-rated model at each step compounds time savings.

Check model ratings in the model picker — hover over any model to see speed, intelligence, context window, and pricing.

Local models via Ollama or LM Studio eliminate network round-trips. They outperform cloud on latency when:

  • The model is small (7B–14B parameters). Larger local models lose the latency advantage unless you have high-end hardware.
  • Network conditions are poor. VPN, satellite, or mobile hotspot connections add 200–500ms per request.
  • Privacy requirements prohibit cloud. Local inference keeps all data on-device with zero cost.
  • With GPU: A 7B model on a modern discrete GPU generates 40–80 tokens/second — comparable to cloud.
  • CPU-only: 5–15 tokens/second. Acceptable for compaction and auto-naming, not for primary use.
  • VRAM limits: If the model does not fit in VRAM, it spills to system RAM and performance degrades sharply.
TaskWhy Local Works
Conversation auto-namingFast, cheap, quality doesn’t matter much
Context compactionSummarization at zero cost
Simple transforms (grammar, formatting)Low latency, no API cost
Privacy-sensitive contentNever leaves your machine

The RAG threshold (default: 30%) determines when documents are injected directly vs. processed through the full vector pipeline:

  • Documents smaller than the threshold % of context window → direct injection (fast, no embedding needed)
  • Documents larger → full RAG pipeline (chunking, embedding, semantic search)

Raise the threshold to inject more documents directly (faster, but uses more context). Lower it to route more through RAG (preserves context space, better for large document sets).

DimensionsSpeedAccuracyTrade-off
256–512FastestGood for narrow domainsLow storage, fast retrieval
768–1024ModerateStrong general-purposeBest balance for most workloads
1536–3072SlowerHighest fidelityUse for precision-critical retrieval

A reranker re-scores retrieved chunks using a cross-encoder model. QARK over-fetches 3x candidates, then the reranker picks the best.

  • No reranker — Fastest. Acceptable when the document set is small and embedding quality is high.
  • With reranker — Adds latency but significantly improves precision. Worth it for large document sets or when retrieval accuracy is critical.

Available rerankers: Cohere, Jina AI, Voyage AI.

StrategyHow It WorksWhen to Use
SemanticStandard vector similarity searchDefault. Works for most queries.
HyDEGenerates a hypothetical answer first, embeds that, searchesBetter for factual questions where the query phrasing differs from the answer phrasing
Step-backGenerates a broader abstract query, searches bothBetter for abstract or conceptual questions
AutoQARK picks the best strategyLet the system decide

Set token_budget as the context strategy with a fixed token limit:

  • 4,000–8,000 tokens — Routine tasks, low cost per turn
  • 16,000–32,000 tokens — Research and analysis with moderate history
  • 64,000+ — Use auto_compact instead for better history management

The budget caps the context sent per turn, not total conversation cost.

Each tool invocation adds a round-trip. QARK’s tool turn limits:

ScenarioTool Turns
Default (no tools or standard tools)10
With unix commands20
With agent-tools or MCP tools50

Lower tool turn limits on leaf agents that have bounded work. An orchestrator needs 50 turns to dispatch 8 sub-agents; a formatting agent needs 2.

  • Cheap models for iteration, frontier for final. Draft with a fast model, then switch to a reasoning model for the final pass.
  • Local models for zero-cost tasks. Compaction, auto-naming, and simple transforms cost nothing with Ollama.
  • none strategy for Sparks and Flows. Stateless execution avoids history accumulation.
  • Monitor the budget dashboard. Per-provider spending breakdown with model-level detail shows where your budget goes.

Two factors affect perceived speed:

  1. Time to first token (TTFT) — Determined by the provider and model. Smaller models and edge-deployed providers have lower TTFT.
  2. Tool execution gaps — When the model invokes a tool, streaming pauses until the tool returns. Faster tool backends (local file access vs. remote API) minimize gaps.

QARK renders streamed content with 80ms throttled rendering and deferred syntax highlighting via requestAnimationFrame — the UI stays responsive even during high-throughput streams.