Performance Tuning
QARK gives you direct control over the tradeoffs between speed, cost, and output quality. This guide covers the tuning levers that matter most.
Context Strategy
Section titled “Context Strategy”The context strategy determines how QARK manages conversation history as it grows. Choose per conversation in the Config tab or set a default per agent.
| Strategy | Speed | Cost | Best For |
|---|---|---|---|
none | Fastest | Zero overhead (but no history) | Sparks, Flows, stateless tool agents |
token_budget | Fast | Predictable ceiling | Cost-sensitive workflows with a hard token cap |
last_n | Fast | Predictable | Simple conversations where only recent messages matter |
first_n | Fast | Predictable | Preserving initial system context |
all | Variable | Grows linearly | Short conversations under 10 turns |
auto_compact | Moderate | Moderate | Long-running sessions that need history access |
Tune Auto-Compact
Section titled “Tune Auto-Compact”auto_compact triggers a compaction pass when context usage crosses a threshold (default: 70% of the model’s context window). The compaction model summarizes older messages, freeing space for new content.
- Threshold — Raise to 85% for fewer compaction passes. Lower to 50% for tighter context windows and faster turns.
- Content overflow strategy — Choose between
truncate(fast, drops oldest content) andsummarize(preserves meaning, costs tokens). - Compaction model — Set in Settings → Sparks & Flows → Default compaction model. A fast, inexpensive model works here — compaction quality matters less than speed. A local model (Ollama) makes compaction essentially free.
Pick Models by Speed
Section titled “Pick Models by Speed”Every model in QARK’s registry carries speed and intelligence ratings. When latency matters more than peak reasoning:
- Models rated 4–5 stars for speed typically respond in under 2 seconds for short prompts.
- Models rated 1–2 stars may take 5–15 seconds but produce stronger reasoning.
- For multi-step Flows or agent chains, a speed-rated model at each step compounds time savings.
Check model ratings in the model picker — hover over any model to see speed, intelligence, context window, and pricing.
Local Models
Section titled “Local Models”Local models via Ollama or LM Studio eliminate network round-trips. They outperform cloud on latency when:
- The model is small (7B–14B parameters). Larger local models lose the latency advantage unless you have high-end hardware.
- Network conditions are poor. VPN, satellite, or mobile hotspot connections add 200–500ms per request.
- Privacy requirements prohibit cloud. Local inference keeps all data on-device with zero cost.
GPU Acceleration
Section titled “GPU Acceleration”- With GPU: A 7B model on a modern discrete GPU generates 40–80 tokens/second — comparable to cloud.
- CPU-only: 5–15 tokens/second. Acceptable for compaction and auto-naming, not for primary use.
- VRAM limits: If the model does not fit in VRAM, it spills to system RAM and performance degrades sharply.
Best Local Model Uses
Section titled “Best Local Model Uses”| Task | Why Local Works |
|---|---|
| Conversation auto-naming | Fast, cheap, quality doesn’t matter much |
| Context compaction | Summarization at zero cost |
| Simple transforms (grammar, formatting) | Low latency, no API cost |
| Privacy-sensitive content | Never leaves your machine |
RAG Parameter Tuning
Section titled “RAG Parameter Tuning”Threshold %
Section titled “Threshold %”The RAG threshold (default: 30%) determines when documents are injected directly vs. processed through the full vector pipeline:
- Documents smaller than the threshold % of context window → direct injection (fast, no embedding needed)
- Documents larger → full RAG pipeline (chunking, embedding, semantic search)
Raise the threshold to inject more documents directly (faster, but uses more context). Lower it to route more through RAG (preserves context space, better for large document sets).
Embedding Model
Section titled “Embedding Model”| Dimensions | Speed | Accuracy | Trade-off |
|---|---|---|---|
| 256–512 | Fastest | Good for narrow domains | Low storage, fast retrieval |
| 768–1024 | Moderate | Strong general-purpose | Best balance for most workloads |
| 1536–3072 | Slower | Highest fidelity | Use for precision-critical retrieval |
Reranker
Section titled “Reranker”A reranker re-scores retrieved chunks using a cross-encoder model. QARK over-fetches 3x candidates, then the reranker picks the best.
- No reranker — Fastest. Acceptable when the document set is small and embedding quality is high.
- With reranker — Adds latency but significantly improves precision. Worth it for large document sets or when retrieval accuracy is critical.
Available rerankers: Cohere, Jina AI, Voyage AI.
Search Strategies
Section titled “Search Strategies”| Strategy | How It Works | When to Use |
|---|---|---|
| Semantic | Standard vector similarity search | Default. Works for most queries. |
| HyDE | Generates a hypothetical answer first, embeds that, searches | Better for factual questions where the query phrasing differs from the answer phrasing |
| Step-back | Generates a broader abstract query, searches both | Better for abstract or conceptual questions |
| Auto | QARK picks the best strategy | Let the system decide |
Token Cost Control
Section titled “Token Cost Control”Token Budget Strategy
Section titled “Token Budget Strategy”Set token_budget as the context strategy with a fixed token limit:
- 4,000–8,000 tokens — Routine tasks, low cost per turn
- 16,000–32,000 tokens — Research and analysis with moderate history
- 64,000+ — Use
auto_compactinstead for better history management
The budget caps the context sent per turn, not total conversation cost.
Tool Turn Limits
Section titled “Tool Turn Limits”Each tool invocation adds a round-trip. QARK’s tool turn limits:
| Scenario | Tool Turns |
|---|---|
| Default (no tools or standard tools) | 10 |
| With unix commands | 20 |
| With agent-tools or MCP tools | 50 |
Lower tool turn limits on leaf agents that have bounded work. An orchestrator needs 50 turns to dispatch 8 sub-agents; a formatting agent needs 2.
Cost-Saving Strategies
Section titled “Cost-Saving Strategies”- Cheap models for iteration, frontier for final. Draft with a fast model, then switch to a reasoning model for the final pass.
- Local models for zero-cost tasks. Compaction, auto-naming, and simple transforms cost nothing with Ollama.
nonestrategy for Sparks and Flows. Stateless execution avoids history accumulation.- Monitor the budget dashboard. Per-provider spending breakdown with model-level detail shows where your budget goes.
Streaming Optimization
Section titled “Streaming Optimization”Two factors affect perceived speed:
- Time to first token (TTFT) — Determined by the provider and model. Smaller models and edge-deployed providers have lower TTFT.
- Tool execution gaps — When the model invokes a tool, streaming pauses until the tool returns. Faster tool backends (local file access vs. remote API) minimize gaps.
QARK renders streamed content with 80ms throttled rendering and deferred syntax highlighting via requestAnimationFrame — the UI stays responsive even during high-throughput streams.