Performance Tuning

QARK gives you direct control over the tradeoffs between speed, cost, and output quality. This guide covers the tuning levers that matter most.

Context Strategy

The context strategy determines how QARK manages conversation history as it grows. Choose per conversation in the Config tab or set a default per agent.

Strategy	Speed	Cost	Best For
`none`	Fastest	Zero overhead (but no history)	Sparks, Flows, stateless tool agents
`token_budget`	Fast	Predictable ceiling	Cost-sensitive workflows with a hard token cap
`last_n`	Fast	Predictable	Simple conversations where only recent messages matter
`first_n`	Fast	Predictable	Preserving initial system context
`all`	Variable	Grows linearly	Short conversations under 10 turns
`auto_compact`	Moderate	Moderate	Long-running sessions that need history access

Tune Auto-Compact

auto_compact triggers a compaction pass when context usage crosses a threshold (default: 70% of the model’s context window). The compaction model summarizes older messages, freeing space for new content.

Threshold — Raise to 85% for fewer compaction passes. Lower to 50% for tighter context windows and faster turns.
Content overflow strategy — Choose between truncate (fast, drops oldest content) and summarize (preserves meaning, costs tokens).
Compaction model — Set in Settings → Sparks & Flows → Default compaction model. A fast, inexpensive model works here — compaction quality matters less than speed. A local model (Ollama) makes compaction essentially free.

Pick Models by Speed

Every model in QARK’s registry carries speed and intelligence ratings. When latency matters more than peak reasoning:

Models rated 4–5 stars for speed typically respond in under 2 seconds for short prompts.
Models rated 1–2 stars may take 5–15 seconds but produce stronger reasoning.
For multi-step Flows or agent chains, a speed-rated model at each step compounds time savings.

Check model ratings in the model picker — hover over any model to see speed, intelligence, context window, and pricing.

Local Models

Local models via Ollama or LM Studio eliminate network round-trips. They outperform cloud on latency when:

The model is small (7B–14B parameters). Larger local models lose the latency advantage unless you have high-end hardware.
Network conditions are poor. VPN, satellite, or mobile hotspot connections add 200–500ms per request.
Privacy requirements prohibit cloud. Local inference keeps all data on-device with zero cost.

GPU Acceleration

With GPU: A 7B model on a modern discrete GPU generates 40–80 tokens/second — comparable to cloud.
CPU-only: 5–15 tokens/second. Acceptable for compaction and auto-naming, not for primary use.
VRAM limits: If the model does not fit in VRAM, it spills to system RAM and performance degrades sharply.

Best Local Model Uses

Task	Why Local Works
Conversation auto-naming	Fast, cheap, quality doesn’t matter much
Context compaction	Summarization at zero cost
Simple transforms (grammar, formatting)	Low latency, no API cost
Privacy-sensitive content	Never leaves your machine

RAG Parameter Tuning

Threshold %

The RAG threshold (default: 30%) determines when documents are injected directly vs. processed through the full vector pipeline:

Documents smaller than the threshold % of context window → direct injection (fast, no embedding needed)
Documents larger → full RAG pipeline (chunking, embedding, semantic search)

Raise the threshold to inject more documents directly (faster, but uses more context). Lower it to route more through RAG (preserves context space, better for large document sets).

Embedding Model

Dimensions	Speed	Accuracy	Trade-off
256–512	Fastest	Good for narrow domains	Low storage, fast retrieval
768–1024	Moderate	Strong general-purpose	Best balance for most workloads
1536–3072	Slower	Highest fidelity	Use for precision-critical retrieval

Reranker

A reranker re-scores retrieved chunks using a cross-encoder model. QARK over-fetches 3x candidates, then the reranker picks the best.

No reranker — Fastest. Acceptable when the document set is small and embedding quality is high.
With reranker — Adds latency but significantly improves precision. Worth it for large document sets or when retrieval accuracy is critical.

Available rerankers: Cohere, Jina AI, Voyage AI.

Search Strategies

Strategy	How It Works	When to Use
Semantic	Standard vector similarity search	Default. Works for most queries.
HyDE	Generates a hypothetical answer first, embeds that, searches	Better for factual questions where the query phrasing differs from the answer phrasing
Step-back	Generates a broader abstract query, searches both	Better for abstract or conceptual questions
Auto	QARK picks the best strategy	Let the system decide

Token Cost Control

Token Budget Strategy

Set token_budget as the context strategy with a fixed token limit:

4,000–8,000 tokens — Routine tasks, low cost per turn
16,000–32,000 tokens — Research and analysis with moderate history
64,000+ — Use auto_compact instead for better history management

The budget caps the context sent per turn, not total conversation cost.

Tool Turn Limits

Each tool invocation adds a round-trip. QARK’s tool turn limits:

Scenario	Tool Turns
Default (no tools or standard tools)	10
With unix commands	20
With agent-tools or MCP tools	50

Lower tool turn limits on leaf agents that have bounded work. An orchestrator needs 50 turns to dispatch 8 sub-agents; a formatting agent needs 2.

Cost-Saving Strategies

Cheap models for iteration, frontier for final. Draft with a fast model, then switch to a reasoning model for the final pass.
Local models for zero-cost tasks. Compaction, auto-naming, and simple transforms cost nothing with Ollama.
none strategy for Sparks and Flows. Stateless execution avoids history accumulation.
Monitor the budget dashboard. Per-provider spending breakdown with model-level detail shows where your budget goes.

Streaming Optimization

Two factors affect perceived speed:

Time to first token (TTFT) — Determined by the provider and model. Smaller models and edge-deployed providers have lower TTFT.
Tool execution gaps — When the model invokes a tool, streaming pauses until the tool returns. Faster tool backends (local file access vs. remote API) minimize gaps.

QARK renders streamed content with 80ms throttled rendering and deferred syntax highlighting via requestAnimationFrame — the UI stays responsive even during high-throughput streams.