In the previous article we defined L1 (GitHub Runner): after git push you need an auditable Fact (see Runner piece · Stack language). Next, many teams install Claude Code on the same Cloud Mac and run brew install ollama — then notice day-to-day coding still goes 100% through the Claude API, while the Ollama process sits idle and holds 6–8GB of unified memory.
That is not “Ollama is useless.” The harder question is why Inference must be its own layer instead of folding into Claude Code. This is not an Ollama install guide, and it does not repeat the 16GB vs 24GB memory benchmarks or M4 vs GPU cloud cost analysis. It only nails why L2 exists, where it stops, and how it relates to L3 (Diff) (L3 detail in Claude Code workstation).
If GitHub Runner answers “did the code actually pass verification,” Ollama answers “which tokens must be inferred on your own Cloud Mac.” Remember the pair: Fact vs Inference.
Cloud Mac AI Stack · L2 in one line
Runner is the execution engine; Ollama is the inference service; Claude Code is the coding agent.
Outputs: Fact, Inference, Diff. L2 is not “which model” — it is turning inference into a long-lived, callable Inference Service (optional).
What L2 owns in the Stack
L3 answers “how to change the repo”; L1 answers “after the change, can we build, sign, and ship”; L2 answers “which inference tokens must never leave our macOS node.” It is not a Claude replacement — it is a second inference pipeline you can stack on Cloud Mac.
L2 output: what Inference is
In the Cloud Mac AI Stack we layer by output, not tool brand (full chain in Runner piece · Stack language):
Memory chain · five outputs (not call order) Context → Inference → Diff → Fact → Workflow (MCP) (Ollama, etc.) (Claude Code) (Runner) (OpenHands) The five-layer diagram still maps L0–L5 components; the output chain keeps Inference alongside Context/Diff/Fact/Workflow so it does not feel like a cross-cutting patch
Inference here means: model forward passes inside macOS processes you control — prompt or embedding in, completion or vectors out, without a third-party inference API (or API only on the L3 main path while L2 handles tokens that must stay local). Ollama is the most common L2 implementation today, not the only one; Core ML and MLX can cover parts of Inference, but Ollama’s model catalog and ops path are the default story for L2.
L2 shape: Inference Service, not occasional ollama run
Saying “Inference” alone still sounds like “another local model app.” In the Stack we stress shape — each layer is a system role you can depend on long term:
| Layer | Component | Shape | Output |
|---|---|---|---|
| L1 | GitHub Runner | Execution Engine | Fact |
| L2 | Ollama (etc.) | Inference Service | Inference |
| L3 | Claude Code | Agent (coding agent) | Diff |
The gap between ollama run qwen3:8b and ollama serve plus health checks plus cron is like “SSH in and run tests once” vs “Runner listening for push”:
ollama run → one-off inference at the terminal (like a manual test run) Inference Service → port 11434 always on, model pinned, Runner / scripts / agents call back
L2 is not about the model — it is about making inference a long-lived Service. Models can change; the interface, callers, schedule, and observability are what the Stack pins down.
Why Inference must be L2, not part of Claude Code
Developers ask: “Isn’t the Claude API inference?” Yes — but the Cloud Mac AI Stack layers by output, not “is there a neural net.”
Claude Code produces Diff — which files change, how, what lands in the PR patch. Inference produces Token — any model forward’s text or vectors. Diff is one use of tokens.
Work that is Inference but should not ride the “coding agent” main path includes:
- Log summarization, classification, review, routing
- Embedding, RAG, rerank
- Agent memory compaction, knowledge-base cleanup
- Nightly batch jobs, scheduled daily reports
So the relationship is:
Claude Code ⊂ one use case of Inference not: Inference ⊂ Claude Code
L2 is separate not because Ollama is “lower level” than Claude, but because a Cloud Mac may run several inference pipelines at once: L3 via API for Diff; L2 on-machine for tokens that must stay private — parallel, not interchangeable. Folding Inference into Claude Code makes teams think “we installed Claude, inference layer done” — the root cause of idle Ollama.
Ollama vs Claude Code: why it is not either-or
Search funnels often frame “Ollama or Claude Code.” In Stack terms the question points the wrong way — different outputs, not substitutes.
| Dimension | Claude Code (L3) | Ollama (L2) |
|---|---|---|
| Output | Diff | Inference |
| Shape | Agent (on-demand coding) | Inference Service (can run 24/7) |
| Compute path | Mostly Claude API | On-machine (localhost:11434, etc.) |
| Typical tasks | Coding, repo edits, patches | Summaries, embedding, classification, nightly batch, compliance subtasks |
| Work style | On demand: you open the agent, it infers; you leave, main path pauses | Always-on: process 24/7, cron / sidecar calls anytime |
| In the Stack | Main path (change code) | Subtasks and private pipeline (need not start before L3) |
Practical rule: coding → Claude Code; “which tokens cannot leave the machine or must run on a schedule” → Ollama. Same-host memory scheduling is in the L2-Q03 follow-up; here we only fix “not either-or.”
Model toy vs private inference layer
Same command ollama run qwen3:8b, two mindsets, two architectures:
| Dimension | Model toy (personal tryout) | L2 · Inference Service |
|---|---|---|
| Goal | Try prompts, compare tok/s, post screenshots | Fixed workloads: compliance summaries, log classification, embedding, nightly batch |
| CI relationship | None | Can stagger with L1 Runner (CI by day, inference by night), shared L0 memory budget |
| vs L3 | Either-or: “local or Claude” | Parallel: coding via API, subtasks via localhost:11434 |
| Success | Model responds | SLA: latency cap, pinned model version, alerts on failure, tokens never leave machine |
| Machine | Laptop sleeps when lid closes | Cloud Mac 24/7: Inference Service observable with Runner / Agent on one stack |
In one line: toys ask “can it run”; L2 asks “what runs, when, and who depends on the output.” If nothing must land on private Inference, skip L2 — do not install Ollama just to “complete the Stack.”
How L2 splits from L1 and L3
Boundaries teams blur most often:
| Layer | Component | Output | Question |
|---|---|---|---|
| L1 | GitHub Runner | Fact | Does this commit build, test, and archive? |
| L2 | Ollama (etc.) | Inference (via Inference Service) | Which inference must stay on-machine and be callable by Runner / agents? |
| L3 | Claude Code | Diff (Agent shape) | How should the repo change? (most teams use Claude API) |
L2 does not produce Diff or Fact. It will not turn PR checks green or replace xcodebuild. Typical wiring: Claude Code edits → Runner verifies → a step or sidecar calls Ollama for log summaries, vulnerability pattern match, private embedding into a vector store — outputs feed Context (L4) or human review, not “build passed” by themselves.
L2 in the five-layer diagram: tier ≠ call order
The series-wide diagram lives in Runner piece · five-layer map. Ollama sits below Claude Code because Inference carries the “private compute” foundation — not because you must ollama serve before opening Claude Code.
Excerpt · full diagram in Runner article
Claude Code L3 · Diff
↑ parallel, not a dependency
Ollama L2 · Inference Service (optional)
↑
GitHub Runner L1 · Fact
↑
Cloud Mac L0 · infrastructure
Remember: Claude API and Ollama can coexist — API for coding Diff, Ollama for on-machine Inference. That is not “local model replaces Claude”; it is how Cloud Mac AI Stack differs from single-machine chat products.
Beyond compliance: always-on value of L2
L2 is often framed as “finance / healthcare / enterprise only.” Compliance is a strong signal, but many Cloud Mac users are indie developers, AI founders, and small teams without a compliance ticket who still need L2.
The difference is work shape:
Claude Code (L3) · on demand you open it → inference starts you close it → main-path inference ends Ollama (L2) · can run 24/7 process stays up → cron / sidecar calls anytime does not need you at the terminal
These jobs are not agent chat, not CI, not daily coding — they are real long-running inference services, for example:
- Hourly Runner / app log compaction
- Nightly embedding rebuild and knowledge-base cleanup
- Auto daily reports, anomaly buckets, routing drafts
- Updating Context (L4) data while Claude Code is offline
A laptop that sleeps when closed cannot host that as infrastructure. Staggered with L1 — Fact by day, Inference by night — is often more realistic for small teams than “buy another GPU cloud hour,” if L2 is operated as a Service, not an occasional terminal experiment.
Local Mac can run; Cloud Mac can operate
People ask: “My Mac mini can brew install ollama — why Cloud Mac?” Fair — local Mac = can run; this article is about operating it as infrastructure.
| Dimension | Local Mac / laptop | Cloud Mac (L0) |
|---|---|---|
| Availability | Stops when lid closes, sleeps, travels | 24/7 always on, stable egress |
| Stack fit | Ollama often a lone endpoint | Runner, Claude Code on one observable, schedulable stack |
| L2 shape | Often ollama run tryouts |
Inference Service: health checks, model pin, sidecar / cron |
| Dependents | Usually just you | CI, agents, cron, team scripts |
Ollama does not exist because of Cloud Mac; Cloud Mac’s value is turning Ollama from a terminal command into a 24/7, observable, schedulable Inference Service that Runner and agents can share. Otherwise readers think this is “just another Ollama article,” not why private inference belongs in the Cloud Mac AI Stack.
For L0 choice see Cloud Mac vs local Mac AI workstation; we do not repeat rental specs here — only that in Stack terms, L2 should sit on a node you can operate.
Workloads that belong on L2
Signals you should treat Ollama as L2 infrastructure, not a toy:
- Compliance / air-gap edge — code and logs may live on Cloud Mac, but inference requests must not hit public APIs; L2 runs redacted subtasks (classification, summary, PII detection).
- Embedding / rerank — CodeGraph or RAG needs stable, version-pinned local embedding models instead of API embedding cost and opaque data paths.
- High-frequency small models, batchable — e.g. nightly CI log classification; 7B–14B quants on Apple Silicon are often enough with predictable cost (vs fixed node vs hourly GPU).
- Stagger with L3 — Claude Code + Runner fill memory by day;
ollamabatch at night instead of treating 24GB as “everything maxed all day.” - Fast path in multi-agent setups — hard edits still via Claude; classification, routing, drafts via local 8B to cut API tokens (desktop agents like OpenHuman use a similar split — see OpenHuman notes).
Conversely, if you only ship iOS, code entirely via Claude API, and have no scheduled local inference, prioritize L1 Runner first — get Fact on push, then consider private Inference.
Series read order: Runner → Ollama → Claude Code
If you follow the Stack series by output:
- L1 · Fact — GitHub Runner execution engine: after
push, does the code really build, test, and ship? - L2 · Inference Service — this article: which tokens must stay on Cloud Mac and be called as a service.
- L3 · Diff — Claude Code workstation: how the repo should change (mostly API).
Foundation: Cloud Mac vs local Mac AI workstation. Context (L4) and CodeGraph articles will connect to Inference outputs later.
Typical misjudgment: Ollama installed, Stack still API-only
brew install ollamaon Cloud Mac, pull two models, team celebrates “AI stack complete.”- Daily work stays 100% Claude Code + Anthropic API; Ollama untouched for a week.
- Idle models hold memory; Claude Code and Runner fight for what's left, swap climbs (see 16GB vs 24GB).
- Lead asks: “Why do we still pay for Cloud Mac?” — because L0 is only hardware; no L2 workloads were defined.
Fix is not another GUI — name 1–2 pipelines that must use L2 (e.g. only failed CI log summaries via Ollama; only embedding via nomic-embed-text), pin model versions, monitor port 11434 health. Parallel scheduling is in L2-Q03 · memory scheduling.
Who can skip L2
| Good fit for L2 (Ollama on Cloud Mac) | Can skip L2 for now |
|---|---|
| Compliance: inference must not leave machine, or air-gap subtasks | Coding and review entirely via Claude API |
| Self-hosted RAG / CodeGraph needs local embedding | No vector index, no local batch jobs |
| 7B–14B for high-frequency small tasks to cut API cost | Only occasional chat to try models |
| Memory stagger planned with L1 (CI by day / inference by night) | Machine only runs Claude Code, no Runner or pipelines |
| Indie / small team wants 24/7 logs, embedding, daily-report inference | No cron or sidecar will ever call on-machine models |
How this relates to published benchmark articles (not repeated here)
Several long “Ollama-related” pieces already exist — division of labor:
- L2-Q02 · 16GB vs 24GB — memory and swap measurements; answers “which RAM tier,” not “where Ollama sits in the Stack.”
- L2-Q04 · M4 vs GPU cloud — billing and scale boundaries; this piece does not compare price.
- L2-Q05 · Core ML — Apple-native inference path; can coexist with Ollama, different runtime.
- L3 · Claude Code workstation — coding experience; this piece adds L2 and how API coding and local Inference coexist.
Rollout order: Fact before Inference
Aligned with Runner piece · rollout order:
- L0 — Cloud Mac with always-on macOS.
- L1 — Runner so
push → green/redis repeatable. - L2 — Ollama only for defined private Inference pipelines (this article).
- L3–L5 — Claude Code, MCP, OpenHands after stable Fact + (optional) Inference.
Stack L2 before L1 and you get cheerful local summaries while xcodebuild still fails on Linux — Inference cannot replace Fact.
L2 series
This article is the L2 foundation (scope and boundaries). Same series next:
| Part | Topic | Status |
|---|---|---|
| ① · this page | Ollama as private inference layer (Inference) | Published |
| ② · Published | AI workload scheduling on Mac mini: avoid Swap with Ollama, Claude Code, and GitHub Runner | Published |
| ③ | Model pin, health checks, CI-side Ollama calls | Planned |
Five-layer responsibility map: Cloud Mac AI Stack five-layer diagram.
FAQ
Claude API is inference too — why a separate Inference layer?
Claude Code produces Diff; L2 produces Token (summaries, embedding, batch, etc.). Claude Code ⊂ Inference use cases, not the reverse.
Ollama or Claude Code — pick one?
No. Coding via L3 API; on-machine or scheduled inference via L2 — see comparison table.
Must I install Ollama before Claude Code?
No. L2 is optional and parallels L3.
Do I need L2 without compliance?
If you need 24/7 inference (logs, embedding, daily reports), indie and small teams still benefit; skip if you only occasionally try models.
My Mac mini runs Ollama — why Cloud Mac?
Local can run; Cloud Mac can operate an Inference Service — 24/7, same stack as Runner/Agent, schedulable and observable. See local vs Cloud Mac.
Ollama vs GitHub Runner — which first?
L1 then L2. Vertical order in the diagram is responsibility tiers, not boot order.
How is this different from the 16GB vs 24GB article?
This defines Stack placement; memory benchmarks in 16GB vs 24GB, cost in M4 vs GPU cloud.
L2 series · follow-up
AI Workload Scheduling on Mac mini: How to Avoid Swap with Ollama + Claude Code + GitHub Runner
L2-Q03 · Memory Scheduling Layer: scheduling fixes for Swap and sluggish CI, including a 30-second runbook.
Read L2-Q03 · AI Workload Scheduling