Local Providers
QARK supports two local providers: Ollama and LM Studio. Both run models on your machine — no API key, no cloud dependency, no per-token cost. Data never leaves your hardware.
When to run locally vs cloud
Section titled “When to run locally vs cloud”| Factor | Local | Cloud |
|---|---|---|
| Privacy | Complete — data stays on your machine | Governed by provider’s data policy |
| Cost | Zero per-token (electricity + hardware only) | Pay-per-token |
| Latency | No network round-trip, but speed depends on your GPU | Network latency, but fast generation on datacenter GPUs |
| Model size | Limited by VRAM/RAM — practical ceiling ~70B params | Frontier models (200B+) |
| Setup | Install runtime + download models | Paste an API key |
Use local when: you need air-gapped privacy, want zero cost for high-volume tasks, or are working with open-source models.
Use cloud when: you need frontier capability, faster generation, or models larger than your hardware supports.
Hardware requirements
Section titled “Hardware requirements”Model size determines VRAM (GPU) or RAM (CPU fallback) requirements:
| Model size | VRAM needed (GPU) | RAM needed (CPU) | Example models |
|---|---|---|---|
| 1B–3B | ~2 GB | ~4 GB | Llama 3.2 1B, Phi-3 Mini |
| 7B–8B | ~4 GB | ~8 GB | Llama 3.1 8B, Mistral 7B, Gemma 2 9B |
| 13B–14B | ~8 GB | ~16 GB | Llama 2 13B, Qwen 2.5 14B |
| 30B–34B | ~20 GB | ~36 GB | DeepSeek Coder 33B, Command R 35B |
| 70B | ~40 GB | ~64 GB | Llama 3.3 70B, Qwen 2.5 72B |
GPU acceleration significantly improves generation speed:
- NVIDIA CUDA — Best ecosystem support on Linux and Windows. Most Ollama and LM Studio models auto-detect CUDA.
- Apple Metal — Native acceleration on Apple Silicon Macs (M1/M2/M3/M4). Both Ollama and LM Studio leverage Metal automatically.
- AMD ROCm — Supported on Linux with compatible AMD GPUs.
CPU-only inference works but generates tokens 5–20x slower than GPU-accelerated inference for the same model.
Ollama
Section titled “ Ollama”Download: ollama.com
Default endpoint: http://localhost:11434
Categories: Chat, Embedding
Open-source local model runtime for macOS, Linux, and Windows. Manages model downloads, quantization, and serving through a single CLI.
Install Ollama
Section titled “Install Ollama”macOS / Linux:
curl -fsSL https://ollama.com/install.sh | shWindows:
Download the installer from ollama.com/download.
After installation, verify Ollama is running:
ollama --versionPull models
Section titled “Pull models”Download models using the ollama pull command:
# Pull a chat modelollama pull llama3.2
# Pull a larger modelollama pull llama3.3:70b
# Pull an embedding modelollama pull nomic-embed-textEach model downloads once and is stored locally. Ollama handles quantization variants automatically — specify a tag like :70b or :q4_0 to select a specific size or quantization level.
Connect QARK to Ollama
Section titled “Connect QARK to Ollama”- Confirm Ollama is running (
ollama serveor the system tray icon on macOS/Windows). - In QARK, open Settings > Providers > Ollama.
- The default endpoint
http://localhost:11434is pre-filled. Adjust if you run Ollama on a different port or remote machine. - QARK auto-detects the connection and loads your installed model list.
No API key is needed. If Ollama is running, QARK connects.
LM Studio
Section titled “ LM Studio”Download: lmstudio.ai
Default endpoint: http://localhost:1234/v1
Categories: Chat
Desktop application for discovering, downloading, and running local models with a graphical interface. Provides an OpenAI-compatible local server that QARK connects to.
Install LM Studio
Section titled “Install LM Studio”- Download LM Studio from lmstudio.ai (available for macOS, Linux, and Windows).
- Launch the application.
- Browse the model catalog and download a model (LM Studio handles quantization selection through its UI).
Start the local server
Section titled “Start the local server”- In LM Studio, navigate to the Local Server tab.
- Load a downloaded model into the server.
- Click Start Server. The default endpoint is
http://localhost:1234/v1.
The server exposes an OpenAI-compatible API — QARK communicates with it using the same protocol as cloud providers.
Connect QARK to LM Studio
Section titled “Connect QARK to LM Studio”- Confirm the LM Studio server is running with a model loaded.
- In QARK, open Settings > Providers > LM Studio.
- The default endpoint
http://localhost:1234/v1is pre-filled. Adjust if needed. - QARK detects the connection and shows the loaded model.
No API key needed. Switch models by loading a different one in LM Studio’s server — QARK picks up the change on the next refresh.
Mixing local and cloud
Section titled “Mixing local and cloud”QARK treats local and cloud providers identically in the model picker. You can:
- Use a local model for drafting and a cloud model for final review in the same conversation.
- Route embedding to a local Ollama model and chat to a cloud provider to minimize API costs.
- Run fully offline with Ollama or LM Studio for both chat and embedding — no internet required.
Switch between local and cloud at any point using the model picker.