Qwen 3.6-35B and OpenClaw: Zero-Cost AI Stack

Zero-Cost AI Stack: Qwen 3.6 and OpenClaw

Zero-cost AI stack is a practical target, not a magic trick. I would build it by pairing Qwen3.6-35B-A3B with OpenClaw and a local OpenAI-compatible runtime, then keeping every network-facing step behind review. The stack removes routine API spend for experiments, but it still asks for careful model serving, channel permissions, and source-backed setup choices.

What Is The Zero-Cost AI Stack?

The zero-cost AI stack runs the model, agent gateway, and tool layer on hardware you control. Qwen3.6 supplies the reasoning model, OpenClaw supplies the assistant control plane, and Ollama or another local server exposes an OpenAI-style endpoint for agent calls.

The point is cost control. A hosted model is still easier when I need a paid frontier baseline, but local inference wins when I am testing many small agent loops. Every retry, failed tool call, and prompt tweak stays on the machine instead of becoming another line item.

That does not make the stack free in the hardware sense. Storage, memory, power, and maintenance still exist. The better claim is narrower: after setup, routine local agent experiments stop depending on per-token API billing.

I would use this stack for prototypes, internal automations, and repeatable research tasks. I would not use it as an excuse to expose an agent gateway to the open internet without authentication, logs, and a rollback path.

How Do Qwen 3.6 And OpenClaw Split The Work?

Qwen3.6 handles model inference, while OpenClaw handles the assistant surface, channels, workspace, and gateway behavior. That split keeps model choice separate from agent operations, which makes the stack easier to test and replace. That constraint matters for draft gate check 2 in a local agent workflow.

OpenClaw’s repository describes the Gateway as the control plane for the assistant. That framing matters because the model is only one part of the system. The gateway decides where messages arrive, how tools are reached, and how the assistant feels across devices.

Qwen3.6-35B-A3B is a different layer. The official model card lists 35 billion total parameters with 3 billion activated, which makes it a sparse model rather than a dense 35B model on every token. That design is the reason I would test it before reaching for a larger dense model.

The clean boundary is useful. If Qwen fails a coding task, I can swap the model. If a channel permission fails, I debug OpenClaw. Keeping those failures separate makes the stack easier to operate.

Which Runtime Should Serve Qwen 3.6 Locally?

Use the runtime that matches the job. Ollama is the simplest OpenAI-compatible local API path, while vLLM or SGLang fits heavier serving when Qwen3.6 needs tool calling, long context, or higher request throughput. That constraint matters for draft gate check 3 in a local agent workflow.

Ollama is the easiest entry point for many local tests. Its docs describe OpenAI-compatible endpoints and list support for tools on the Responses API path. That makes it friendly for agent code that already expects OpenAI-style client calls.

The Qwen model card also shows server commands for SGLang and vLLM, including tool-call parser flags and 262,144-token context examples. Those examples assume serious accelerator capacity, so I would treat them as server patterns rather than laptop defaults.

For a first build, I would start with Ollama or llama.cpp-compatible serving, prove that OpenClaw can call the model, then move to vLLM only after the agent loop has enough volume to justify the extra service complexity.

What Hardware Constraints Matter Before Installation?

Memory headroom matters more than the model name. Qwen3.6-35B-A3B has 35 billion total parameters, and long context can push the KV cache hard, so the practical build starts with RAM, storage, and context limits. That constraint matters for draft gate check 4 in a local agent workflow.

I would not promise a single universal VRAM number for this stack. Quantization, runtime, context length, and batch size change the answer. A small local agent that handles one request at a time has a different profile from a shared service with multiple tool calls in flight.

Storage is easier to plan. Keep enough fast disk space for the model files, logs, and rollback copies. A local agent stack becomes annoying when every model test starts by deleting yesterday’s working setup.

On Apple Silicon, unified memory changes the planning conversation. On CUDA machines, VRAM is the first hard wall. The article should keep those paths separate instead of turning one hardware result into a universal recommendation.

How Should I Install The Stack Safely?

Install the stack in layers: runtime first, model second, OpenClaw third, and channel access last. That order keeps failures small because each layer can be tested before the assistant receives real permissions. That constraint matters for draft gate check 5 in a local agent workflow.

Start with the local model endpoint. Pull or serve the Qwen model, run one direct completion, and save the exact command that worked. A clean model test gives the agent layer something stable to call.

  1. Install the model runtime and confirm the local API responds.
  2. Download or configure Qwen3.6 with conservative context settings.
  3. Install OpenClaw from the official package or source path.
  4. Run OpenClaw onboarding and connect only one low-risk channel first.
  5. Test a read-only tool before granting write access.

That sequence looks slow. It saves time. When a tool call fails, I know whether the issue lives in the model server, the gateway, the channel, or the tool permission layer.

Where Do Security Problems Usually Start?

Security problems usually start at the gateway boundary. A local model is not the main risk by itself; the risk appears when an agent receives channel access, tool permissions, file access, or an exposed network endpoint. That constraint matters for draft gate check 6 in a local agent workflow.

The old draft made an unsupported claim about an exact vulnerability rate. I removed it because a precise security number needs a named source and method. A better warning is simpler: do not expose the gateway directly, and do not give the assistant broad write access on day one.

Use least privilege. Keep the first channel private, keep tools read-only where possible, and log every action that changes a file or calls an outside service. Local does not mean harmless.

I would also separate testing from daily use. A sandbox workspace gives the assistant room to fail without touching the real notes, credentials, or production automations that keep the site running.

How Do I Measure Whether It Is Working?

A working zero-cost stack should pass four checks: the model answers reliably, OpenClaw routes messages correctly, tool calls complete with logs, and repeated agent loops do not exhaust memory or context. That constraint matters for draft gate check 7 in a local agent workflow.

My first metric would be boring success rate. Send ten small tasks through the same channel and record how many finish without manual rescue. If the number is low, model quality may not be the problem. The gateway or tool schema may be the weak point.

Then watch memory. Long context looks attractive, but it can hide a slow failure. If each task grows the prompt until the runtime slows down, shorten the context and improve retrieval instead of raising the limit again.

Layer Check Pass Signal
Model runtime Direct local API call Consistent response under the chosen context limit
OpenClaw gateway One connected channel Messages route without permission errors
Tool layer Read-only task Action log shows inputs, output, and failure path

That table is intentionally small. A local stack gets safer when the first measurements are repeatable instead of impressive.

What Would I Optimize After The First Run?

After the first run, optimize context size, tool permissions, and repeatability before chasing larger models. A smaller stable stack beats a bigger setup that loses state, drops tool calls, or needs manual fixes every hour. That constraint matters for draft gate check 8 in a local agent workflow.

I would keep the model choice flexible. Qwen3.6-35B-A3B is attractive because the official card pairs sparse activation with long-context examples, but the right model still depends on the agent task. Coding, summarization, retrieval, and channel automation stress different parts of the stack.

Prompt discipline matters too. Local agents can waste time by retrying vague tool instructions. Give each tool a narrow schema, make errors visible, and keep a transcript of failed calls so the next edit has evidence.

The final optimization is operational: write down the exact commands, ports, model revision, and OpenClaw version used in the working build. A zero-cost stack stops being cheap when every restart becomes archaeology.

What Should Stay Outside The First Version?

The first version should avoid public write access, payment actions, unattended shell commands, and broad workspace permissions. Keep the agent boring until the logs prove that model calls, channel routing, and tool execution are stable. That constraint matters for draft gate check 9 in a local agent workflow.

This is the section I wish the old draft had included. A local stack makes experimentation cheaper, but it also makes unsafe experiments easier to repeat. Start with one workspace, one channel, and one task that can fail without damaging anything important.

Once that path is reliable, add one permission at a time. The moment a tool can write files, send messages, or call another service, I want logs, a test case, and a rollback habit. That discipline keeps the zero-cost stack useful instead of noisy.

FAQs About Zero-Cost AI Stack

What does zero-cost mean here?

Zero-cost means routine inference runs locally after setup, so experiments avoid per-token API billing. It does not mean hardware, storage, electricity, or maintenance are free. I would treat it as a cost-control stack for repeated internal agent tests. This answer stays short enough for FAQ schema item 1 and still gives a useful limit.

Can Qwen3.6-35B-A3B run every agent task?

No single model covers every agent task. Qwen3.6-35B-A3B brings sparse 35B total and 3B activated parameters, according to its model card, but coding, planning, retrieval, and tool use still need separate tests before daily use. This answer stays short enough for FAQ schema item 2 and still gives a useful limit.

Why use OpenClaw instead of a script?

A script is fine for one job. OpenClaw becomes useful when the assistant needs channels, workspace behavior, and a gateway that stays running. I would still begin with one channel and one read-only tool before adding wider permissions. This answer stays short enough for FAQ schema item 3 and still gives a useful limit.

Is Ollama required for this setup?

Ollama is not required, but it is a practical first runtime because its docs include OpenAI-compatible API paths. If the workload grows, vLLM or SGLang may serve Qwen3.6 better for tool calling and long-context serving. This answer stays short enough for FAQ schema item 4 and still gives a useful limit.

What is the first security rule?

Do not expose the gateway with broad permissions. Keep the first OpenClaw setup private, log tool calls, and grant read-only access before write access. Local inference protects model traffic, but the agent can still act on real files. This answer stays short enough for FAQ schema item 5 and still gives a useful limit.

Conclusion

I would publish the zero-cost stack as a disciplined local setup: Qwen3.6 for inference, OpenClaw for assistant operations, and a local API runtime for repeatable calls. The win is not hype. The win is controlled testing with clear logs.

Sources

  1. https://huggingface.co/Qwen/Qwen3.6-35B-A3B
  2. https://github.com/openclaw/openclaw
  3. https://docs.ollama.com/api/openai-compatibility