Escaping the Agentic Token Tax: Replacing…

Jun 11

opencode + ollama for the win.

1 Comment

Hey Dan! Really cool to see you trying out local AI. I was lucky enough to see the writing on the wall about a month ago, so I’ve been tinkering with local for a little bit.

First, if I’m not mistaken about your MacBook specs, you probably have unified memory, meaning you can definitely be using a larger model. Google actually released the Gemma 4 family of models a couple months back, which includes a 12B, 26B, and 31B variant. The whole family can call tools, and they do okay at coding, but not amazing. They’re also Apache 2.0 licensed.

Second recommendation is to try a larger Qwen model, if you can. I use 3.6 35B A3B, but that may be a bit large for a 24 GB memory size.

Speaking of size — use quantization to let you run more models. You can quantize from full model weights down to Q8 with almost no quality loss. Most local users quantize down to Q4 (myself included — that’s how I run the 35B A3B model in my 24 GB GPU). Google actually released versions of their Gemma 4 models that handle quantization a bit better just this week — “Quantization Aware Training”, or QAT. Worth considering.

Lots of other ways to approach this depending on how in depth you want to go: choose a model serving backend that allows more configuration of model parameters than Ollama, leaner coding agent frameworks (e.g., Pi, at pi.dev; more control over system prompt than opencode, but far less “batteries included”), context management tools, more guardrails to optimize the performance of these smaller models, the list goes on. And of course, leverage agent skills, rules, etc.

Definitely not saying it will match cloud models by any means — good to stay realistic! Happy to talk more if you’d like.