Skip to main content
Workshop Desktop can use open-weight AI models running on your local machine instead of cloud-hosted models from Anthropic, OpenAI, or Google. Your prompts, code, and context never leave your hardware. Workshop connects to any server that implements the Anthropic Messages API — the same protocol used by Claude. The most common option is llama-server from the llama.cpp project, which added native Anthropic Messages API support in early 2026. Other compatible servers include LM Studio and Ollama (with an Anthropic adapter).

Setting Up a Local Model

1

Start a local model server

Run a model server that exposes an Anthropic Messages API-compatible endpoint. The default endpoint is http://127.0.0.1:8080. See Starting llama-server below for a step-by-step guide.
2

Open Agent Settings

In Workshop Desktop, click the settings icon and navigate to Agent Settings. Scroll down to the Local Model Setup section.
3

Enter the base URL

Enter the base URL of your local server (e.g., http://127.0.0.1:8080). Optionally, enter a display name for the model (e.g., “Qwen3-Coder-Next Q4”).
4

Test the connection

Click Test Connection. Workshop sends a minimal request to the server’s /v1/messages endpoint to verify it’s running and responding. A green “Connected” indicator confirms success.
5

Select the Local model

Open the model picker in the chat toolbar. A new Local option appears alongside the cloud models (Fast, Balanced, Genius). Select it to route all requests to your local server.

Starting llama-server

The easiest way to serve a local model is with llama-server from llama.cpp. You can ask Workshop itself to help you set it up:
Try these prompts in a Workshop Desktop conversation:
  • “Help me install llama-server and download the Qwen3-Coder-Next Q4_K_M GGUF model”
  • “Set up a local model server on port 8080 for coding”
  • “Download and run the Step 3.5 Flash Q4_K_M model with llama-server”
Workshop can install llama.cpp, download model files from HuggingFace, and start the server — all within a single conversation.
If you prefer to set it up manually, here’s the typical workflow:
# Build llama.cpp (macOS/Linux with Metal/CUDA support)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

# Download a GGUF model (example: Qwen3-Coder-Next Q4_K_M)
# Get the URL from https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

# Start llama-server
./build/bin/llama-server \
  -m /path/to/Qwen3-Coder-Next-Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 32768 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --min-p 0.01
The server is ready when you see output indicating it’s listening on port 8080. Workshop connects to it at http://127.0.0.1:8080.
The following three models are well-suited for agentic coding tasks with Workshop Desktop. All are available in GGUF format for local deployment.
Hardware requirements listed below are estimates based on model size, quantization level, and context window. Actual requirements vary by system configuration, other running processes, and llama.cpp version. Always leave headroom beyond the minimum.

Qwen3-Coder-Next (80B MoE, 3B active)

The top recommendation for local coding. Qwen3-Coder-Next is an 80-billion parameter Mixture-of-Experts model with only 3 billion parameters active per token. This means it delivers large-model quality at small-model speed. Released February 2026 by Alibaba.
  • Native context window: 256K tokens
  • Architecture: MoE with selective activation — only a small fraction of the model processes each token
  • Strengths: Long-horizon reasoning, complex tool use, recovery from execution failures
QuantizationModel SizeMin. MemoryBest For
Q3_K (3-bit)~30 GB~34 GBMachines with 36–48 GB available memory
Q4_K_M (4-bit)~42 GB~46 GBRecommended default — best quality/speed tradeoff
Q8_0 (8-bit)~80 GB~85 GBMaximum quality, requires high-end hardware
Context window impact on memory (Q4_K_M):
Context LengthTotal Memory
4K tokens~47 GB
32K tokens~48 GB
64K tokens~49 GB
256K tokens (full)~54 GB
The MoE architecture keeps KV cache growth minimal — going from 4K to 256K context adds only ~7 GB. Performance: Expect 20+ tokens/second when the model fits entirely on your device. On high-end NVIDIA GPUs (RTX PRO 6000), prompt processing reaches 1,400–2,900 tokens/second with generation at 60–85 tokens/second.
GGUF files are available from unsloth/Qwen3-Coder-Next-GGUF on HuggingFace. Use the Dynamic GGUF variants from Unsloth for the best quality-per-bit.

Step 3.5 Flash (196B MoE, 11B active)

A larger model from StepFun with frontier-level coding capability and impressive speed. Step 3.5 Flash has 196 billion total parameters with 11 billion active per token, using a sophisticated MoE architecture with 288 experts per layer.
  • Native context window: 256K tokens
  • Architecture: 45 transformer layers (3 dense + 42 MoE), hybrid sliding-window/full attention (3:1 ratio), Multi-Token Prediction for faster generation
  • Strengths: Complex multi-step tasks, high throughput (100–350 tokens/second on server hardware), strong general-purpose coding
QuantizationModel SizeMin. MemoryBest For
Q4_K_S (4-bit)~105 GB~112 GB128 GB systems with tight memory budget
Q4_K_M (4-bit)~110 GB~118 GBRecommended default on 128 GB+ systems
Q8_0 (8-bit)~200 GB~210 GBMulti-GPU setups or very high RAM systems
Step 3.5 Flash is a significantly larger model than Qwen3-Coder-Next and requires more memory. It’s best suited for machines with 128 GB+ unified memory or multi-GPU configurations. Performance: On consumer hardware with sufficient memory, expect 10–20 tokens/second at Q4 quantization. The model’s Multi-Token Prediction architecture can produce bursts of faster throughput depending on the inference engine.
GGUF files are available from stepfun-ai on HuggingFace. Third-party quantizations (including lower-bit options) are also available from community contributors.

MiniMax M2.5 (230B MoE, ~10B active)

A strong general-purpose model with excellent coding ability and a massive 1M token native context. MiniMax M2.5 has 230 billion total parameters with approximately 10 billion active per token.
  • Native context window: 1M tokens (one of the largest available for local deployment)
  • Architecture: MoE with lightning attention
  • Strengths: Very long context windows, broad capability across coding and reasoning tasks
QuantizationModel SizeMin. MemoryBest For
UD-Q3_K_XL (3-bit)~101 GB~110 GB128 GB systems; best quality at this memory tier
Q3_K_L (3-bit)~110 GB~128 GB128 GB Apple Silicon (runs without swap)
Q8_0 (8-bit)~243 GB~256 GBMulti-GPU or high-RAM server setups
Performance on Apple Silicon: On an M3 Max with 128 GB unified memory, the Q3_K_L quantization runs at ~29 tokens/second for generation and ~99 tokens/second for prompt processing, supporting context windows up to 196K tokens without swap.
Unsloth provides Dynamic GGUF quantizations for MiniMax M2.5 with improved quality-per-bit. Available at unsloth on HuggingFace. Also check ox-ox/MiniMax-M2.5-GGUF for additional quantization options.

Hardware Guide

Apple Silicon (Unified Memory)

Apple Silicon Macs share memory between the CPU and GPU, making them efficient for running local models. The entire model must fit in unified memory.
Unified MemoryWhat You Can Run
32 GBQwen3-Coder-Next Q3_K only (tight fit, short context). Marginal for agentic coding.
64 GBQwen3-Coder-Next Q4_K_M comfortably with 32K context. The recommended entry point for local model coding.
96 GBQwen3-Coder-Next Q8_0, or Step 3.5 Flash/MiniMax M2.5 at aggressive 2-bit quantization (experimental).
128 GBStep 3.5 Flash Q4_K_M, MiniMax M2.5 Q3_K_L, or Qwen3-Coder-Next Q8_0 with full 256K context.

NVIDIA GPUs

NVIDIA GPUs with CUDA are the fastest option for local inference. The model must fit in VRAM for best performance (partial offloading to system RAM is possible but much slower).
GPUVRAMWhat Fits
RTX 3090 / 409024 GBToo small for the recommended models at useful quantization levels. Consider smaller models (7B–14B) for these GPUs.
RTX A6000 / RTX 6000 Ada48 GBQwen3-Coder-Next Q4_K_M with moderate context. The sweet spot for single-GPU coding agents.
RTX PRO 600096 GBQwen3-Coder-Next at any quantization with full 256K context. Excellent performance.
2x RTX 3090/409048 GB totalQwen3-Coder-Next Q4_K_M with split-GPU inference. Requires multi-GPU support in llama.cpp.

AMD GPUs

AMD GPUs are supported through ROCm. Support has improved significantly but remains less mature than CUDA. If you have an AMD GPU with 24+ GB VRAM, build llama.cpp with ROCm support and follow the same sizing guidelines as NVIDIA GPUs.

CPU-Only

Running models on CPU alone (system RAM) is possible but slow — expect 1–5 tokens/second depending on the model and quantization. Useful for testing and experimentation but not practical for interactive agentic coding sessions. Apple Silicon is the exception, since its unified memory architecture means “CPU-only” still benefits from the GPU’s memory bandwidth.

Approval Mode Recommendation

Strongly recommended: Enable Approval Mode (Code Execution set to Manual) in Agent Settings when using local models. This requires you to approve each code execution before it runs.This is especially critical for models outside the three recommended above.
Local models are less reliable than cloud-hosted frontier models at following tool-use contracts and respecting safety boundaries. They may:
  • Execute commands you didn’t intend
  • Misinterpret tool schemas and call tools incorrectly
  • Generate code that modifies files or system state unexpectedly
Approval Mode adds a confirmation step before any code runs, giving you the chance to review what the agent wants to execute. You can toggle this in Agent Settings → Code Execution → Manual. For the three recommended models (Qwen3-Coder-Next, Step 3.5 Flash, MiniMax M2.5) at their recommended quantizations, tool-use reliability is generally good. For other models — especially smaller or more aggressively quantized ones — Approval Mode is essential.

Compatible Servers

Workshop Desktop connects to any server that exposes an Anthropic Messages API-compatible endpoint at /v1/messages. The following servers are known to work:
ServerNotes
llama-server (llama.cpp)Native Anthropic Messages API support. Recommended. Supports streaming, tool use, and vision.
LM StudioGUI application for running local models. Check documentation for Anthropic API compatibility settings.
OllamaRequires an Anthropic adapter (e.g., local-openai2anthropic) since Ollama natively serves an OpenAI-compatible API.
vLLMHigh-throughput inference engine with Anthropic endpoint support. Better suited for server-class hardware.

Ecosystem and Resources

The local model ecosystem is evolving rapidly. New models with better quality at smaller sizes are released frequently. If the recommended models above don’t fit your hardware, check back — options for 16 GB and 32 GB machines improve with each new model release. Useful links:

Troubleshooting

The local model server isn’t running, or it’s on a different port than the URL you entered. Verify the server is started and check the port number.
The server is running but doesn’t have an Anthropic Messages API endpoint at /v1/messages. This usually means the server uses an OpenAI-compatible API instead. Use an adapter like local-openai2anthropic, or switch to llama-server which has native support.
The model likely doesn’t fit entirely in your GPU VRAM or unified memory, causing parts to be offloaded to system RAM. Try a more aggressive quantization (e.g., Q3_K instead of Q4_K_M) or reduce the context window with the --ctx-size flag.
Check your sampling parameters. The recommended settings for coding models are: temperature=1.0, top_p=0.95, top_k=40, min_p=0.01. Also verify you’re using a model intended for instruction following, not a base (pre-trained) model.
The model plus context window exceeds your available memory. Reduce --ctx-size (e.g., from 32768 to 8192) or use a smaller quantization. On macOS, check Activity Monitor to see actual memory pressure.