Local Providers

QARK supports two local providers: Ollama and LM Studio. Both run models on your machine — no API key, no cloud dependency, no per-token cost. Data never leaves your hardware.

When to run locally vs cloud

Factor	Local	Cloud
Privacy	Complete — data stays on your machine	Governed by provider’s data policy
Cost	Zero per-token (electricity + hardware only)	Pay-per-token
Latency	No network round-trip, but speed depends on your GPU	Network latency, but fast generation on datacenter GPUs
Model size	Limited by VRAM/RAM — practical ceiling ~70B params	Frontier models (200B+)
Setup	Install runtime + download models	Paste an API key

Use local when: you need air-gapped privacy, want zero cost for high-volume tasks, or are working with open-source models.

Use cloud when: you need frontier capability, faster generation, or models larger than your hardware supports.

Hardware requirements

Model size determines VRAM (GPU) or RAM (CPU fallback) requirements:

Model size	VRAM needed (GPU)	RAM needed (CPU)	Example models
1B–3B	~2 GB	~4 GB	Llama 3.2 1B, Phi-3 Mini
7B–8B	~4 GB	~8 GB	Llama 3.1 8B, Mistral 7B, Gemma 2 9B
13B–14B	~8 GB	~16 GB	Llama 2 13B, Qwen 2.5 14B
30B–34B	~20 GB	~36 GB	DeepSeek Coder 33B, Command R 35B
70B	~40 GB	~64 GB	Llama 3.3 70B, Qwen 2.5 72B

GPU acceleration significantly improves generation speed:

NVIDIA CUDA — Best ecosystem support on Linux and Windows. Most Ollama and LM Studio models auto-detect CUDA.
Apple Metal — Native acceleration on Apple Silicon Macs (M1/M2/M3/M4). Both Ollama and LM Studio leverage Metal automatically.
AMD ROCm — Supported on Linux with compatible AMD GPUs.

CPU-only inference works but generates tokens 5–20x slower than GPU-accelerated inference for the same model.

Ollama

Download: ollama.com Default endpoint: http://localhost:11434 Categories: Chat, Embedding

Open-source local model runtime for macOS, Linux, and Windows. Manages model downloads, quantization, and serving through a single CLI.

Install Ollama

macOS / Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com/download.

After installation, verify Ollama is running:

ollama --version

Pull models

Download models using the ollama pull command:

# Pull a chat model
ollama pull llama3.2

# Pull a larger model
ollama pull llama3.3:70b

# Pull an embedding model
ollama pull nomic-embed-text

Each model downloads once and is stored locally. Ollama handles quantization variants automatically — specify a tag like :70b or :q4_0 to select a specific size or quantization level.

Connect QARK to Ollama

Confirm Ollama is running (ollama serve or the system tray icon on macOS/Windows).
In QARK, open Settings > Providers > Ollama.
The default endpoint http://localhost:11434 is pre-filled. Adjust if you run Ollama on a different port or remote machine.
QARK auto-detects the connection and loads your installed model list.

No API key is needed. If Ollama is running, QARK connects.

LM Studio

Download: lmstudio.ai Default endpoint: http://localhost:1234/v1 Categories: Chat

Desktop application for discovering, downloading, and running local models with a graphical interface. Provides an OpenAI-compatible local server that QARK connects to.

Install LM Studio

Download LM Studio from lmstudio.ai (available for macOS, Linux, and Windows).
Launch the application.
Browse the model catalog and download a model (LM Studio handles quantization selection through its UI).

Start the local server

In LM Studio, navigate to the Local Server tab.
Load a downloaded model into the server.
Click Start Server. The default endpoint is http://localhost:1234/v1.

The server exposes an OpenAI-compatible API — QARK communicates with it using the same protocol as cloud providers.

Connect QARK to LM Studio

Confirm the LM Studio server is running with a model loaded.
In QARK, open Settings > Providers > LM Studio.
The default endpoint http://localhost:1234/v1 is pre-filled. Adjust if needed.
QARK detects the connection and shows the loaded model.

No API key needed. Switch models by loading a different one in LM Studio’s server — QARK picks up the change on the next refresh.

Mixing local and cloud

QARK treats local and cloud providers identically in the model picker. You can:

Use a local model for drafting and a cloud model for final review in the same conversation.
Route embedding to a local Ollama model and chat to a cloud provider to minimize API costs.
Run fully offline with Ollama or LM Studio for both chat and embedding — no internet required.

Switch between local and cloud at any point using the model picker.