llama-server from the llama.cpp project, which added native Anthropic Messages API support in early 2026. Other compatible servers include LM Studio and Ollama (with an Anthropic adapter).
Setting Up a Local Model
Start a local model server
Run a model server that exposes an Anthropic Messages API-compatible endpoint. The default endpoint is
http://127.0.0.1:8080. See Starting llama-server below for a step-by-step guide.Open Agent Settings
In Workshop Desktop, click the settings icon and navigate to Agent Settings. Scroll down to the Local Model Setup section.
Enter the base URL
Enter the base URL of your local server (e.g.,
http://127.0.0.1:8080). Optionally, enter a display name for the model (e.g., “Qwen3-Coder-Next Q4”).Test the connection
Click Test Connection. Workshop sends a minimal request to the server’s
/v1/messages endpoint to verify it’s running and responding. A green “Connected” indicator confirms success.Starting llama-server
The easiest way to serve a local model is withllama-server from llama.cpp. You can ask Workshop itself to help you set it up:
Sample prompts for Workshop to help you set up
Sample prompts for Workshop to help you set up
Try these prompts in a Workshop Desktop conversation:
- “Help me install llama-server and download the Qwen3-Coder-Next Q4_K_M GGUF model”
- “Set up a local model server on port 8080 for coding”
- “Download and run the Step 3.5 Flash Q4_K_M model with llama-server”
http://127.0.0.1:8080.
Recommended Models
The following three models are well-suited for agentic coding tasks with Workshop Desktop. All are available in GGUF format for local deployment.Qwen3-Coder-Next (80B MoE, 3B active)
The top recommendation for local coding. Qwen3-Coder-Next is an 80-billion parameter Mixture-of-Experts model with only 3 billion parameters active per token. This means it delivers large-model quality at small-model speed. Released February 2026 by Alibaba.- Native context window: 256K tokens
- Architecture: MoE with selective activation — only a small fraction of the model processes each token
- Strengths: Long-horizon reasoning, complex tool use, recovery from execution failures
| Quantization | Model Size | Min. Memory | Best For |
|---|---|---|---|
| Q3_K (3-bit) | ~30 GB | ~34 GB | Machines with 36–48 GB available memory |
| Q4_K_M (4-bit) | ~42 GB | ~46 GB | Recommended default — best quality/speed tradeoff |
| Q8_0 (8-bit) | ~80 GB | ~85 GB | Maximum quality, requires high-end hardware |
| Context Length | Total Memory |
|---|---|
| 4K tokens | ~47 GB |
| 32K tokens | ~48 GB |
| 64K tokens | ~49 GB |
| 256K tokens (full) | ~54 GB |
Step 3.5 Flash (196B MoE, 11B active)
A larger model from StepFun with frontier-level coding capability and impressive speed. Step 3.5 Flash has 196 billion total parameters with 11 billion active per token, using a sophisticated MoE architecture with 288 experts per layer.- Native context window: 256K tokens
- Architecture: 45 transformer layers (3 dense + 42 MoE), hybrid sliding-window/full attention (3:1 ratio), Multi-Token Prediction for faster generation
- Strengths: Complex multi-step tasks, high throughput (100–350 tokens/second on server hardware), strong general-purpose coding
| Quantization | Model Size | Min. Memory | Best For |
|---|---|---|---|
| Q4_K_S (4-bit) | ~105 GB | ~112 GB | 128 GB systems with tight memory budget |
| Q4_K_M (4-bit) | ~110 GB | ~118 GB | Recommended default on 128 GB+ systems |
| Q8_0 (8-bit) | ~200 GB | ~210 GB | Multi-GPU setups or very high RAM systems |
MiniMax M2.5 (230B MoE, ~10B active)
A strong general-purpose model with excellent coding ability and a massive 1M token native context. MiniMax M2.5 has 230 billion total parameters with approximately 10 billion active per token.- Native context window: 1M tokens (one of the largest available for local deployment)
- Architecture: MoE with lightning attention
- Strengths: Very long context windows, broad capability across coding and reasoning tasks
| Quantization | Model Size | Min. Memory | Best For |
|---|---|---|---|
| UD-Q3_K_XL (3-bit) | ~101 GB | ~110 GB | 128 GB systems; best quality at this memory tier |
| Q3_K_L (3-bit) | ~110 GB | ~128 GB | 128 GB Apple Silicon (runs without swap) |
| Q8_0 (8-bit) | ~243 GB | ~256 GB | Multi-GPU or high-RAM server setups |
Hardware Guide
Apple Silicon (Unified Memory)
Apple Silicon Macs share memory between the CPU and GPU, making them efficient for running local models. The entire model must fit in unified memory.| Unified Memory | What You Can Run |
|---|---|
| 32 GB | Qwen3-Coder-Next Q3_K only (tight fit, short context). Marginal for agentic coding. |
| 64 GB | Qwen3-Coder-Next Q4_K_M comfortably with 32K context. The recommended entry point for local model coding. |
| 96 GB | Qwen3-Coder-Next Q8_0, or Step 3.5 Flash/MiniMax M2.5 at aggressive 2-bit quantization (experimental). |
| 128 GB | Step 3.5 Flash Q4_K_M, MiniMax M2.5 Q3_K_L, or Qwen3-Coder-Next Q8_0 with full 256K context. |
NVIDIA GPUs
NVIDIA GPUs with CUDA are the fastest option for local inference. The model must fit in VRAM for best performance (partial offloading to system RAM is possible but much slower).| GPU | VRAM | What Fits |
|---|---|---|
| RTX 3090 / 4090 | 24 GB | Too small for the recommended models at useful quantization levels. Consider smaller models (7B–14B) for these GPUs. |
| RTX A6000 / RTX 6000 Ada | 48 GB | Qwen3-Coder-Next Q4_K_M with moderate context. The sweet spot for single-GPU coding agents. |
| RTX PRO 6000 | 96 GB | Qwen3-Coder-Next at any quantization with full 256K context. Excellent performance. |
| 2x RTX 3090/4090 | 48 GB total | Qwen3-Coder-Next Q4_K_M with split-GPU inference. Requires multi-GPU support in llama.cpp. |
AMD GPUs
AMD GPUs are supported through ROCm. Support has improved significantly but remains less mature than CUDA. If you have an AMD GPU with 24+ GB VRAM, build llama.cpp with ROCm support and follow the same sizing guidelines as NVIDIA GPUs.CPU-Only
Running models on CPU alone (system RAM) is possible but slow — expect 1–5 tokens/second depending on the model and quantization. Useful for testing and experimentation but not practical for interactive agentic coding sessions. Apple Silicon is the exception, since its unified memory architecture means “CPU-only” still benefits from the GPU’s memory bandwidth.Approval Mode Recommendation
Local models are less reliable than cloud-hosted frontier models at following tool-use contracts and respecting safety boundaries. They may:- Execute commands you didn’t intend
- Misinterpret tool schemas and call tools incorrectly
- Generate code that modifies files or system state unexpectedly
Compatible Servers
Workshop Desktop connects to any server that exposes an Anthropic Messages API-compatible endpoint at/v1/messages. The following servers are known to work:
| Server | Notes |
|---|---|
| llama-server (llama.cpp) | Native Anthropic Messages API support. Recommended. Supports streaming, tool use, and vision. |
| LM Studio | GUI application for running local models. Check documentation for Anthropic API compatibility settings. |
| Ollama | Requires an Anthropic adapter (e.g., local-openai2anthropic) since Ollama natively serves an OpenAI-compatible API. |
| vLLM | High-throughput inference engine with Anthropic endpoint support. Better suited for server-class hardware. |
Ecosystem and Resources
The local model ecosystem is evolving rapidly. New models with better quality at smaller sizes are released frequently. If the recommended models above don’t fit your hardware, check back — options for 16 GB and 32 GB machines improve with each new model release. Useful links:- Unsloth Documentation — Local Models — Guides for running Qwen3, MiniMax, and other models locally with optimized quantizations
- llama.cpp — Anthropic Messages API — Setup guide for llama-server with Anthropic API support
- HuggingFace GGUF Models — Browse trending GGUF models available for download
- LM Studio — GUI-based model runner with model discovery and download built in
- Unsloth Tool Calling Guide — Guide for tool calling with local models (relevant for agentic use)
Troubleshooting
Test Connection fails with 'Connection refused'
Test Connection fails with 'Connection refused'
The local model server isn’t running, or it’s on a different port than the URL you entered. Verify the server is started and check the port number.
Test Connection fails with 'Endpoint not found'
Test Connection fails with 'Endpoint not found'
The server is running but doesn’t have an Anthropic Messages API endpoint at
/v1/messages. This usually means the server uses an OpenAI-compatible API instead. Use an adapter like local-openai2anthropic, or switch to llama-server which has native support.Model is very slow (< 5 tokens/second)
Model is very slow (< 5 tokens/second)
The model likely doesn’t fit entirely in your GPU VRAM or unified memory, causing parts to be offloaded to system RAM. Try a more aggressive quantization (e.g., Q3_K instead of Q4_K_M) or reduce the context window with the
--ctx-size flag.Model produces poor quality or nonsensical output
Model produces poor quality or nonsensical output
Check your sampling parameters. The recommended settings for coding models are:
temperature=1.0, top_p=0.95, top_k=40, min_p=0.01. Also verify you’re using a model intended for instruction following, not a base (pre-trained) model.Out of memory errors
Out of memory errors
The model plus context window exceeds your available memory. Reduce
--ctx-size (e.g., from 32768 to 8192) or use a smaller quantization. On macOS, check Activity Monitor to see actual memory pressure.