Context Strategies

Why context strategy matters

Every LLM has a finite context window — 4K, 8K, 128K, 200K, or 1M tokens depending on the model. How you fill that window determines the quality of every response. Send too much irrelevant history and the model loses focus. Send too little and it loses continuity.

QARK lets you choose a context strategy that controls exactly which messages reach the model on every turn.

The six strategies

1. auto_compact (default)

The default strategy for all conversations. It monitors token usage and compacts history when it approaches the window limit.

How it works:

A threshold is set at 70% of the model’s context window.
On every turn, the kernel counts the total tokens in the conversation history.
When the count exceeds the threshold, compaction triggers automatically.
A compaction model (configurable — can differ from the conversation model) summarizes the oldest messages into a compressed representation.
The compaction is recursive: if the first pass still exceeds the threshold, it runs again on the next oldest batch.
The kernel emits a context_compacting event, and a visual divider appears in the conversation UI marking where compaction occurred.

Content overflow sub-options:

Option	Behavior	Tradeoff
`truncate`	Drops oldest messages beyond the threshold	Fast, no extra API call, loses detail
`summarize`	Compaction model summarizes oldest messages	Preserves meaning, costs an additional inference call

The compaction model is configurable at two levels: per-conversation (in conversation settings) or globally (in Settings → Agent Defaults). A smaller, cheaper model often works well for compaction — it does not need to be the same model generating responses.

When to use it: Long-running research conversations, multi-session projects, any conversation that might exceed the context window over time.

2. last_n

Retains only the last N message pairs (user + assistant). Everything older is discarded before the request reaches the model.

Example: With last_n set to 5, a conversation with 20 exchanges sends only the most recent 5 user-assistant pairs (10 messages total) plus the system prompt.

When to use it: Conversations where only recent context matters — iterating on a code snippet, refining a paragraph, troubleshooting a specific error.

3. first_n

Retains only the first N message pairs. Everything after is discarded.

When to use it: Workflows where the initial instructions or context are critical and later messages are disposable. Useful when the first few messages establish a detailed specification and subsequent messages are incremental requests against that spec.

4. all

Sends the entire conversation history on every turn. No filtering, no compaction.

When to use it: Short conversations where you are confident the total token count will stay within the model’s context window. Useful for conversations with models that have large context windows (128K+) when the conversation will remain brief.

Risk: If the conversation exceeds the model’s context window, the request will fail with a token limit error. There is no automatic fallback.

5. none

Sends zero history. Every turn is stateless — only the system prompt and the current user message reach the model.

This is the default strategy for Sparks and Flows, where each execution is intentionally independent.

When to use it: One-off transformations, stateless utilities, Sparks, Flows, and any workflow where prior conversation context would cause interference rather than help.

6. token_budget

Sets a hard token limit on the total history sent. The kernel fills the budget starting from the most recent messages and working backward. When the budget is exhausted, older messages are excluded.

Example: With a token budget of 4000, the kernel includes messages from newest to oldest until the running token count hits 4000. The remaining older messages are excluded entirely.

When to use it: Cost control on expensive models, strict latency requirements (fewer input tokens = faster time to first token), or when you want predictable per-turn costs.

The context bar

Every conversation displays a context bar — a visual progress indicator showing how much of the model’s context window is currently consumed.

Fill level	Color	Meaning
Below 50%	Green	Plenty of headroom
50% – 80%	Yellow	Approaching the threshold
Above 80%	Red	Near or past the compaction threshold

Additional context indicators:

Context warning badges appear on the conversation tab when usage exceeds 80%.
A Compact button in the conversation toolbar lets you trigger manual compaction at any time, regardless of the automatic threshold.
Token counts update in real time as the model streams its response.

Choose the right strategy

Scenario	Recommended strategy	Why
Long research session	`auto_compact`	Preserves continuity across dozens of turns without exceeding the window
Quick code iteration	`last_n` (3–5 pairs)	Only recent context matters; keeps requests fast and cheap
Specification-first workflow	`first_n`	Preserves the detailed spec; later back-and-forth is disposable
Short Q&A (under 20 turns)	`all`	No overhead, full fidelity
Spark or Flow execution	`none`	Stateless by design; history would cause interference
Strict cost budget	`token_budget`	Predictable input costs on every turn

You can change the context strategy at any point during a conversation. The new strategy applies starting from the next message — it does not retroactively alter what was already sent.