Ollama is the private inference layer in the Cloud Mac AI Stack — not a local model toy

Once Claude Code can edit your repo and Runner can build it, why keep Ollama running on Cloud Mac?

Cloud Mac AI Stack · L2  ·  2026.06.04  ·  ~14 min read  ·  architecture piece, no Ollama install tutorial

Ollama private inference on Apple Silicon Mac as part of the Cloud Mac AI workload stack

In the previous article we defined L1 (GitHub Runner): after git push you need an auditable Fact (see Runner piece · Stack language). Next, many teams install Claude Code on the same Cloud Mac and run brew install ollama — then notice day-to-day coding still goes 100% through the Claude API, while the Ollama process sits idle and holds 6–8GB of unified memory.

That is not “Ollama is useless.” The harder question is why Inference must be its own layer instead of folding into Claude Code. This is not an Ollama install guide, and it does not repeat the 16GB vs 24GB memory benchmarks or M4 vs GPU cloud cost analysis. It only nails why L2 exists, where it stops, and how it relates to L3 (Diff) (L3 detail in Claude Code workstation).

If GitHub Runner answers “did the code actually pass verification,” Ollama answers “which tokens must be inferred on your own Cloud Mac.” Remember the pair: Fact vs Inference.

L2
Inference Service
Optional
Not before Claude Code
0
install tutorials

Cloud Mac AI Stack · L2 in one line

Runner is the execution engine; Ollama is the inference service; Claude Code is the coding agent.

Outputs: Fact, Inference, Diff. L2 is not “which model” — it is turning inference into a long-lived, callable Inference Service (optional).

What L2 owns in the Stack

L3 answers “how to change the repo”; L1 answers “after the change, can we build, sign, and ship”; L2 answers “which inference tokens must never leave our macOS node.” It is not a Claude replacement — it is a second inference pipeline you can stack on Cloud Mac.

L2 output: what Inference is

In the Cloud Mac AI Stack we layer by output, not tool brand (full chain in Runner piece · Stack language):

Memory chain · five outputs (not call order)

  Context  →  Inference  →  Diff  →  Fact  →  Workflow
  (MCP)      (Ollama, etc.)  (Claude Code)  (Runner)  (OpenHands)

The five-layer diagram still maps L0–L5 components; the output chain keeps Inference alongside Context/Diff/Fact/Workflow so it does not feel like a cross-cutting patch

Inference here means: model forward passes inside macOS processes you control — prompt or embedding in, completion or vectors out, without a third-party inference API (or API only on the L3 main path while L2 handles tokens that must stay local). Ollama is the most common L2 implementation today, not the only one; Core ML and MLX can cover parts of Inference, but Ollama’s model catalog and ops path are the default story for L2.

L2 shape: Inference Service, not occasional ollama run

Saying “Inference” alone still sounds like “another local model app.” In the Stack we stress shape — each layer is a system role you can depend on long term:

Layer Component Shape Output
L1 GitHub Runner Execution Engine Fact
L2 Ollama (etc.) Inference Service Inference
L3 Claude Code Agent (coding agent) Diff

The gap between ollama run qwen3:8b and ollama serve plus health checks plus cron is like “SSH in and run tests once” vs “Runner listening for push”:

ollama run          →  one-off inference at the terminal (like a manual test run)
Inference Service   →  port 11434 always on, model pinned, Runner / scripts / agents call back

L2 is not about the model — it is about making inference a long-lived Service. Models can change; the interface, callers, schedule, and observability are what the Stack pins down.

Why Inference must be L2, not part of Claude Code

Developers ask: “Isn’t the Claude API inference?” Yes — but the Cloud Mac AI Stack layers by output, not “is there a neural net.”

Claude Code produces Diff — which files change, how, what lands in the PR patch. Inference produces Token — any model forward’s text or vectors. Diff is one use of tokens.

Work that is Inference but should not ride the “coding agent” main path includes:

  • Log summarization, classification, review, routing
  • Embedding, RAG, rerank
  • Agent memory compaction, knowledge-base cleanup
  • Nightly batch jobs, scheduled daily reports

So the relationship is:

Claude Code  ⊂  one use case of Inference

not:

Inference  ⊂  Claude Code

L2 is separate not because Ollama is “lower level” than Claude, but because a Cloud Mac may run several inference pipelines at once: L3 via API for Diff; L2 on-machine for tokens that must stay private — parallel, not interchangeable. Folding Inference into Claude Code makes teams think “we installed Claude, inference layer done” — the root cause of idle Ollama.

Ollama vs Claude Code: why it is not either-or

Search funnels often frame “Ollama or Claude Code.” In Stack terms the question points the wrong way — different outputs, not substitutes.

Dimension Claude Code (L3) Ollama (L2)
Output Diff Inference
Shape Agent (on-demand coding) Inference Service (can run 24/7)
Compute path Mostly Claude API On-machine (localhost:11434, etc.)
Typical tasks Coding, repo edits, patches Summaries, embedding, classification, nightly batch, compliance subtasks
Work style On demand: you open the agent, it infers; you leave, main path pauses Always-on: process 24/7, cron / sidecar calls anytime
In the Stack Main path (change code) Subtasks and private pipeline (need not start before L3)

Practical rule: coding → Claude Code; “which tokens cannot leave the machine or must run on a schedule” → Ollama. Same-host memory scheduling is in the L2-Q03 follow-up; here we only fix “not either-or.”

Model toy vs private inference layer

Same command ollama run qwen3:8b, two mindsets, two architectures:

Dimension Model toy (personal tryout) L2 · Inference Service
Goal Try prompts, compare tok/s, post screenshots Fixed workloads: compliance summaries, log classification, embedding, nightly batch
CI relationship None Can stagger with L1 Runner (CI by day, inference by night), shared L0 memory budget
vs L3 Either-or: “local or Claude” Parallel: coding via API, subtasks via localhost:11434
Success Model responds SLA: latency cap, pinned model version, alerts on failure, tokens never leave machine
Machine Laptop sleeps when lid closes Cloud Mac 24/7: Inference Service observable with Runner / Agent on one stack

In one line: toys ask “can it run”; L2 asks “what runs, when, and who depends on the output.” If nothing must land on private Inference, skip L2 — do not install Ollama just to “complete the Stack.”

How L2 splits from L1 and L3

Boundaries teams blur most often:

Layer Component Output Question
L1 GitHub Runner Fact Does this commit build, test, and archive?
L2 Ollama (etc.) Inference (via Inference Service) Which inference must stay on-machine and be callable by Runner / agents?
L3 Claude Code Diff (Agent shape) How should the repo change? (most teams use Claude API)

L2 does not produce Diff or Fact. It will not turn PR checks green or replace xcodebuild. Typical wiring: Claude Code edits → Runner verifies → a step or sidecar calls Ollama for log summaries, vulnerability pattern match, private embedding into a vector store — outputs feed Context (L4) or human review, not “build passed” by themselves.

L2 in the five-layer diagram: tier ≠ call order

The series-wide diagram lives in Runner piece · five-layer map. Ollama sits below Claude Code because Inference carries the “private compute” foundation — not because you must ollama serve before opening Claude Code.

Excerpt · full diagram in Runner article

  Claude Code  L3 · Diff
       ↑ parallel, not a dependency
  Ollama       L2 · Inference Service (optional)
       ↑
  GitHub Runner L1 · Fact
       ↑
  Cloud Mac    L0 · infrastructure

Remember: Claude API and Ollama can coexist — API for coding Diff, Ollama for on-machine Inference. That is not “local model replaces Claude”; it is how Cloud Mac AI Stack differs from single-machine chat products.

Beyond compliance: always-on value of L2

L2 is often framed as “finance / healthcare / enterprise only.” Compliance is a strong signal, but many Cloud Mac users are indie developers, AI founders, and small teams without a compliance ticket who still need L2.

The difference is work shape:

Claude Code (L3) · on demand
  you open it → inference starts
  you close it → main-path inference ends

Ollama (L2) · can run 24/7
  process stays up → cron / sidecar calls anytime
  does not need you at the terminal

These jobs are not agent chat, not CI, not daily coding — they are real long-running inference services, for example:

  • Hourly Runner / app log compaction
  • Nightly embedding rebuild and knowledge-base cleanup
  • Auto daily reports, anomaly buckets, routing drafts
  • Updating Context (L4) data while Claude Code is offline

A laptop that sleeps when closed cannot host that as infrastructure. Staggered with L1 — Fact by day, Inference by night — is often more realistic for small teams than “buy another GPU cloud hour,” if L2 is operated as a Service, not an occasional terminal experiment.

Local Mac can run; Cloud Mac can operate

People ask: “My Mac mini can brew install ollama — why Cloud Mac?” Fair — local Mac = can run; this article is about operating it as infrastructure.

Dimension Local Mac / laptop Cloud Mac (L0)
Availability Stops when lid closes, sleeps, travels 24/7 always on, stable egress
Stack fit Ollama often a lone endpoint Runner, Claude Code on one observable, schedulable stack
L2 shape Often ollama run tryouts Inference Service: health checks, model pin, sidecar / cron
Dependents Usually just you CI, agents, cron, team scripts

Ollama does not exist because of Cloud Mac; Cloud Mac’s value is turning Ollama from a terminal command into a 24/7, observable, schedulable Inference Service that Runner and agents can share. Otherwise readers think this is “just another Ollama article,” not why private inference belongs in the Cloud Mac AI Stack.

For L0 choice see Cloud Mac vs local Mac AI workstation; we do not repeat rental specs here — only that in Stack terms, L2 should sit on a node you can operate.

Workloads that belong on L2

Signals you should treat Ollama as L2 infrastructure, not a toy:

  • Compliance / air-gap edge — code and logs may live on Cloud Mac, but inference requests must not hit public APIs; L2 runs redacted subtasks (classification, summary, PII detection).
  • Embedding / rerankCodeGraph or RAG needs stable, version-pinned local embedding models instead of API embedding cost and opaque data paths.
  • High-frequency small models, batchable — e.g. nightly CI log classification; 7B–14B quants on Apple Silicon are often enough with predictable cost (vs fixed node vs hourly GPU).
  • Stagger with L3 — Claude Code + Runner fill memory by day; ollama batch at night instead of treating 24GB as “everything maxed all day.”
  • Fast path in multi-agent setups — hard edits still via Claude; classification, routing, drafts via local 8B to cut API tokens (desktop agents like OpenHuman use a similar split — see OpenHuman notes).

Conversely, if you only ship iOS, code entirely via Claude API, and have no scheduled local inference, prioritize L1 Runner first — get Fact on push, then consider private Inference.

Series read order: Runner → Ollama → Claude Code

If you follow the Stack series by output:

  1. L1 · FactGitHub Runner execution engine: after push, does the code really build, test, and ship?
  2. L2 · Inference Service — this article: which tokens must stay on Cloud Mac and be called as a service.
  3. L3 · DiffClaude Code workstation: how the repo should change (mostly API).

Foundation: Cloud Mac vs local Mac AI workstation. Context (L4) and CodeGraph articles will connect to Inference outputs later.

Typical misjudgment: Ollama installed, Stack still API-only

  1. brew install ollama on Cloud Mac, pull two models, team celebrates “AI stack complete.”
  2. Daily work stays 100% Claude Code + Anthropic API; Ollama untouched for a week.
  3. Idle models hold memory; Claude Code and Runner fight for what's left, swap climbs (see 16GB vs 24GB).
  4. Lead asks: “Why do we still pay for Cloud Mac?” — because L0 is only hardware; no L2 workloads were defined.

Fix is not another GUI — name 1–2 pipelines that must use L2 (e.g. only failed CI log summaries via Ollama; only embedding via nomic-embed-text), pin model versions, monitor port 11434 health. Parallel scheduling is in L2-Q03 · memory scheduling.

Who can skip L2

Good fit for L2 (Ollama on Cloud Mac) Can skip L2 for now
Compliance: inference must not leave machine, or air-gap subtasks Coding and review entirely via Claude API
Self-hosted RAG / CodeGraph needs local embedding No vector index, no local batch jobs
7B–14B for high-frequency small tasks to cut API cost Only occasional chat to try models
Memory stagger planned with L1 (CI by day / inference by night) Machine only runs Claude Code, no Runner or pipelines
Indie / small team wants 24/7 logs, embedding, daily-report inference No cron or sidecar will ever call on-machine models

Several long “Ollama-related” pieces already exist — division of labor:

Rollout order: Fact before Inference

Aligned with Runner piece · rollout order:

  1. L0 — Cloud Mac with always-on macOS.
  2. L1 — Runner so push → green/red is repeatable.
  3. L2 — Ollama only for defined private Inference pipelines (this article).
  4. L3–L5 — Claude Code, MCP, OpenHands after stable Fact + (optional) Inference.

Stack L2 before L1 and you get cheerful local summaries while xcodebuild still fails on Linux — Inference cannot replace Fact.

L2 series

This article is the L2 foundation (scope and boundaries). Same series next:

Part Topic Status
· this page Ollama as private inference layer (Inference) Published
· Published AI workload scheduling on Mac mini: avoid Swap with Ollama, Claude Code, and GitHub Runner Published
Model pin, health checks, CI-side Ollama calls Planned

Five-layer responsibility map: Cloud Mac AI Stack five-layer diagram.

FAQ

Claude API is inference too — why a separate Inference layer?
Claude Code produces Diff; L2 produces Token (summaries, embedding, batch, etc.). Claude Code ⊂ Inference use cases, not the reverse.

Ollama or Claude Code — pick one?
No. Coding via L3 API; on-machine or scheduled inference via L2 — see comparison table.

Must I install Ollama before Claude Code?
No. L2 is optional and parallels L3.

Do I need L2 without compliance?
If you need 24/7 inference (logs, embedding, daily reports), indie and small teams still benefit; skip if you only occasionally try models.

My Mac mini runs Ollama — why Cloud Mac?
Local can run; Cloud Mac can operate an Inference Service — 24/7, same stack as Runner/Agent, schedulable and observable. See local vs Cloud Mac.

Ollama vs GitHub Runner — which first?
L1 then L2. Vertical order in the diagram is responsibility tiers, not boot order.

How is this different from the 16GB vs 24GB article?
This defines Stack placement; memory benchmarks in 16GB vs 24GB, cost in M4 vs GPU cloud.

L2 series · follow-up

AI Workload Scheduling on Mac mini: How to Avoid Swap with Ollama + Claude Code + GitHub Runner

L2-Q03 · Memory Scheduling Layer: scheduling fixes for Swap and sluggish CI, including a 30-second runbook.

Read L2-Q03 · AI Workload Scheduling
Private inference Cloud Mac pricing