Skip to content
Download for Mac

Local Providers

QARK supports two local providers: Ollama and LM Studio. Both run models on your machine — no API key, no cloud dependency, no per-token cost. Data never leaves your hardware.

FactorLocalCloud
PrivacyComplete — data stays on your machineGoverned by provider’s data policy
CostZero per-token (electricity + hardware only)Pay-per-token
LatencyNo network round-trip, but speed depends on your GPUNetwork latency, but fast generation on datacenter GPUs
Model sizeLimited by VRAM/RAM — practical ceiling ~70B paramsFrontier models (200B+)
SetupInstall runtime + download modelsPaste an API key

Use local when: you need air-gapped privacy, want zero cost for high-volume tasks, or are working with open-source models.

Use cloud when: you need frontier capability, faster generation, or models larger than your hardware supports.

Model size determines VRAM (GPU) or RAM (CPU fallback) requirements:

Model sizeVRAM needed (GPU)RAM needed (CPU)Example models
1B–3B~2 GB~4 GBLlama 3.2 1B, Phi-3 Mini
7B–8B~4 GB~8 GBLlama 3.1 8B, Mistral 7B, Gemma 2 9B
13B–14B~8 GB~16 GBLlama 2 13B, Qwen 2.5 14B
30B–34B~20 GB~36 GBDeepSeek Coder 33B, Command R 35B
70B~40 GB~64 GBLlama 3.3 70B, Qwen 2.5 72B

GPU acceleration significantly improves generation speed:

  • NVIDIA CUDA — Best ecosystem support on Linux and Windows. Most Ollama and LM Studio models auto-detect CUDA.
  • Apple Metal — Native acceleration on Apple Silicon Macs (M1/M2/M3/M4). Both Ollama and LM Studio leverage Metal automatically.
  • AMD ROCm — Supported on Linux with compatible AMD GPUs.

CPU-only inference works but generates tokens 5–20x slower than GPU-accelerated inference for the same model.


Download: ollama.com Default endpoint: http://localhost:11434 Categories: Chat, Embedding

Open-source local model runtime for macOS, Linux, and Windows. Manages model downloads, quantization, and serving through a single CLI.

macOS / Linux:

Terminal window
curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download the installer from ollama.com/download.

After installation, verify Ollama is running:

Terminal window
ollama --version

Download models using the ollama pull command:

Terminal window
# Pull a chat model
ollama pull llama3.2
# Pull a larger model
ollama pull llama3.3:70b
# Pull an embedding model
ollama pull nomic-embed-text

Each model downloads once and is stored locally. Ollama handles quantization variants automatically — specify a tag like :70b or :q4_0 to select a specific size or quantization level.

  1. Confirm Ollama is running (ollama serve or the system tray icon on macOS/Windows).
  2. In QARK, open Settings > Providers > Ollama.
  3. The default endpoint http://localhost:11434 is pre-filled. Adjust if you run Ollama on a different port or remote machine.
  4. QARK auto-detects the connection and loads your installed model list.

No API key is needed. If Ollama is running, QARK connects.


Download: lmstudio.ai Default endpoint: http://localhost:1234/v1 Categories: Chat

Desktop application for discovering, downloading, and running local models with a graphical interface. Provides an OpenAI-compatible local server that QARK connects to.

  1. Download LM Studio from lmstudio.ai (available for macOS, Linux, and Windows).
  2. Launch the application.
  3. Browse the model catalog and download a model (LM Studio handles quantization selection through its UI).
  1. In LM Studio, navigate to the Local Server tab.
  2. Load a downloaded model into the server.
  3. Click Start Server. The default endpoint is http://localhost:1234/v1.

The server exposes an OpenAI-compatible API — QARK communicates with it using the same protocol as cloud providers.

  1. Confirm the LM Studio server is running with a model loaded.
  2. In QARK, open Settings > Providers > LM Studio.
  3. The default endpoint http://localhost:1234/v1 is pre-filled. Adjust if needed.
  4. QARK detects the connection and shows the loaded model.

No API key needed. Switch models by loading a different one in LM Studio’s server — QARK picks up the change on the next refresh.


QARK treats local and cloud providers identically in the model picker. You can:

  • Use a local model for drafting and a cloud model for final review in the same conversation.
  • Route embedding to a local Ollama model and chat to a cloud provider to minimize API costs.
  • Run fully offline with Ollama or LM Studio for both chat and embedding — no internet required.

Switch between local and cloud at any point using the model picker.