Skip to content
Download for Mac

Context Strategies

Every LLM has a finite context window — 4K, 8K, 128K, 200K, or 1M tokens depending on the model. How you fill that window determines the quality of every response. Send too much irrelevant history and the model loses focus. Send too little and it loses continuity.

QARK lets you choose a context strategy that controls exactly which messages reach the model on every turn.

The default strategy for all conversations. It monitors token usage and compacts history when it approaches the window limit.

How it works:

  • A threshold is set at 70% of the model’s context window.
  • On every turn, the kernel counts the total tokens in the conversation history.
  • When the count exceeds the threshold, compaction triggers automatically.
  • A compaction model (configurable — can differ from the conversation model) summarizes the oldest messages into a compressed representation.
  • The compaction is recursive: if the first pass still exceeds the threshold, it runs again on the next oldest batch.
  • The kernel emits a context_compacting event, and a visual divider appears in the conversation UI marking where compaction occurred.

Content overflow sub-options:

OptionBehaviorTradeoff
truncateDrops oldest messages beyond the thresholdFast, no extra API call, loses detail
summarizeCompaction model summarizes oldest messagesPreserves meaning, costs an additional inference call

The compaction model is configurable at two levels: per-conversation (in conversation settings) or globally (in Settings → Agent Defaults). A smaller, cheaper model often works well for compaction — it does not need to be the same model generating responses.

When to use it: Long-running research conversations, multi-session projects, any conversation that might exceed the context window over time.

Retains only the last N message pairs (user + assistant). Everything older is discarded before the request reaches the model.

Example: With last_n set to 5, a conversation with 20 exchanges sends only the most recent 5 user-assistant pairs (10 messages total) plus the system prompt.

When to use it: Conversations where only recent context matters — iterating on a code snippet, refining a paragraph, troubleshooting a specific error.

Retains only the first N message pairs. Everything after is discarded.

When to use it: Workflows where the initial instructions or context are critical and later messages are disposable. Useful when the first few messages establish a detailed specification and subsequent messages are incremental requests against that spec.

Sends the entire conversation history on every turn. No filtering, no compaction.

When to use it: Short conversations where you are confident the total token count will stay within the model’s context window. Useful for conversations with models that have large context windows (128K+) when the conversation will remain brief.

Risk: If the conversation exceeds the model’s context window, the request will fail with a token limit error. There is no automatic fallback.

Sends zero history. Every turn is stateless — only the system prompt and the current user message reach the model.

This is the default strategy for Sparks and Flows, where each execution is intentionally independent.

When to use it: One-off transformations, stateless utilities, Sparks, Flows, and any workflow where prior conversation context would cause interference rather than help.

Sets a hard token limit on the total history sent. The kernel fills the budget starting from the most recent messages and working backward. When the budget is exhausted, older messages are excluded.

Example: With a token budget of 4000, the kernel includes messages from newest to oldest until the running token count hits 4000. The remaining older messages are excluded entirely.

When to use it: Cost control on expensive models, strict latency requirements (fewer input tokens = faster time to first token), or when you want predictable per-turn costs.

All context algorithms

Every conversation displays a context bar — a visual progress indicator showing how much of the model’s context window is currently consumed.

Fill levelColorMeaning
Below 50%GreenPlenty of headroom
50% – 80%YellowApproaching the threshold
Above 80%RedNear or past the compaction threshold

Additional context indicators:

  • Context warning badges appear on the conversation tab when usage exceeds 80%.
  • A Compact button in the conversation toolbar lets you trigger manual compaction at any time, regardless of the automatic threshold.
  • Token counts update in real time as the model streams its response.
ScenarioRecommended strategyWhy
Long research sessionauto_compactPreserves continuity across dozens of turns without exceeding the window
Quick code iterationlast_n (3–5 pairs)Only recent context matters; keeps requests fast and cheap
Specification-first workflowfirst_nPreserves the detailed spec; later back-and-forth is disposable
Short Q&A (under 20 turns)allNo overhead, full fidelity
Spark or Flow executionnoneStateless by design; history would cause interference
Strict cost budgettoken_budgetPredictable input costs on every turn

You can change the context strategy at any point during a conversation. The new strategy applies starting from the next message — it does not retroactively alter what was already sent.