Is it Ollama or Claude Code — pick one?

No. L3 coding usually uses the Claude API for Diff; L2 Ollama handles private on-machine Inference. They run in parallel; Ollama is not a prerequisite.

Do I need L2 without compliance requirements?

If you need 24/7 inference (scheduled log summaries, embedding rebuilds, knowledge-base cleanup, daily reports), indie devs and small teams still benefit from L2. Skip it if you only occasionally try models.

Ollama vs GitHub Runner — which comes first?

Roll out L1 Runner for Fact first, then add L2. The diagram’s vertical order is responsibility tiers, not boot order.

How is this different from the 16GB vs 24GB benchmark article?

This piece defines L2 in the Stack. Memory benchmarks are in the 16GB vs 24GB article; GPU cloud costs are in the M4 vs GPU cloud piece.

My Mac mini can run Ollama — why Cloud Mac?

Local Mac can run; Cloud Mac can operate an Inference Service: 24/7, same stack as Runner and Claude Code, schedulable and observable. Cloud Mac turns ollama from a terminal command into a service CI and agents can call.

Ollama Is the Private Inference Layer in the Cloud Mac AI Stack — Not a Local Model Toy

Q: Claude API is inference too — why a separate Inference layer?

Claude Code produces Diff; Inference produces Token. Diff is one use of tokens. Log summarization, embedding, RAG, and nightly batch jobs are Inference — Claude Code ⊂ Inference use cases, not Inference ⊂ Claude Code.

Q: Must I install Ollama before Claude Code?

No. L2 is optional and parallels L3 — not a dependency.

In the previous article we defined L1 (GitHub Runner): after git push you need an auditable Fact (see Runner piece · Stack language). Next, many teams install Claude Code on the same Cloud Mac and run brew install ollama — then notice day-to-day coding still goes 100% through the Claude API, while the Ollama process sits idle and holds 6–8GB of unified memory.

That is not “Ollama is useless.” The harder question is why Inference must be its own layer instead of folding into Claude Code. This is not an Ollama install guide, and it does not repeat the 16GB vs 24GB memory benchmarks or M4 vs GPU cloud cost analysis. It only nails why L2 exists, where it stops, and how it relates to L3 (Diff) (L3 detail in Claude Code workstation).

If GitHub Runner answers “did the code actually pass verification,” Ollama answers “which tokens must be inferred on your own Cloud Mac.” Remember the pair: Fact vs Inference.

Inference Service

Optional

Not before Claude Code

install tutorials

Cloud Mac AI Stack · L2 in one line

Runner is the execution engine; Ollama is the inference service; Claude Code is the coding agent.

Outputs: Fact, Inference, Diff. L2 is not “which model” — it is turning inference into a long-lived, callable Inference Service (optional).

What L2 owns in the Stack

L3 answers “how to change the repo”; L1 answers “after the change, can we build, sign, and ship”; L2 answers “which inference tokens must never leave our macOS node.” It is not a Claude replacement — it is a second inference pipeline you can stack on Cloud Mac.

L2 output: what Inference is

In the Cloud Mac AI Stack we layer by output, not tool brand (full chain in Runner piece · Stack language):

Memory chain · five outputs (not call order)

  Context  →  Inference  →  Diff  →  Fact  →  Workflow
  (MCP)      (Ollama, etc.)  (Claude Code)  (Runner)  (OpenHands)

The five-layer diagram still maps L0–L5 components; the output chain keeps Inference alongside Context/Diff/Fact/Workflow so it does not feel like a cross-cutting patch

Inference here means: model forward passes inside macOS processes you control — prompt or embedding in, completion or vectors out, without a third-party inference API (or API only on the L3 main path while L2 handles tokens that must stay local). Ollama is the most common L2 implementation today, not the only one; Core ML and MLX can cover parts of Inference, but Ollama’s model catalog and ops path are the default story for L2.

L2 shape: Inference Service, not occasional `ollama run`

Saying “Inference” alone still sounds like “another local model app.” In the Stack we stress shape — each layer is a system role you can depend on long term:

Layer	Component	Shape	Output
L1	GitHub Runner	Execution Engine	Fact
L2	Ollama (etc.)	Inference Service	Inference
L3	Claude Code	Agent (coding agent)	Diff

The gap between ollama run qwen3:8b and ollama serve plus health checks plus cron is like “SSH in and run tests once” vs “Runner listening for push”:

ollama run          →  one-off inference at the terminal (like a manual test run)
Inference Service   →  port 11434 always on, model pinned, Runner / scripts / agents call back

L2 is not about the model — it is about making inference a long-lived Service. Models can change; the interface, callers, schedule, and observability are what the Stack pins down.

Why Inference must be L2, not part of Claude Code

Developers ask: “Isn’t the Claude API inference?” Yes — but the Cloud Mac AI Stack layers by output, not “is there a neural net.”

Claude Code produces Diff — which files change, how, what lands in the PR patch. Inference produces Token — any model forward’s text or vectors. Diff is one use of tokens.

Work that is Inference but should not ride the “coding agent” main path includes:

Log summarization, classification, review, routing
Embedding, RAG, rerank
Agent memory compaction, knowledge-base cleanup
Nightly batch jobs, scheduled daily reports

So the relationship is:

Claude Code  ⊂  one use case of Inference

not:

Inference  ⊂  Claude Code

L2 is separate not because Ollama is “lower level” than Claude, but because a Cloud Mac may run several inference pipelines at once: L3 via API for Diff; L2 on-machine for tokens that must stay private — parallel, not interchangeable. Folding Inference into Claude Code makes teams think “we installed Claude, inference layer done” — the root cause of idle Ollama.

Ollama vs Claude Code: why it is not either-or

Search funnels often frame “Ollama or Claude Code.” In Stack terms the question points the wrong way — different outputs, not substitutes.

Dimension	Claude Code (L3)	Ollama (L2)
Output	Diff	Inference
Shape	Agent (on-demand coding)	Inference Service (can run 24/7)
Compute path	Mostly Claude API	On-machine (`localhost:11434`, etc.)
Typical tasks	Coding, repo edits, patches	Summaries, embedding, classification, nightly batch, compliance subtasks
Work style	On demand: you open the agent, it infers; you leave, main path pauses	Always-on: process 24/7, cron / sidecar calls anytime
In the Stack	Main path (change code)	Subtasks and private pipeline (need not start before L3)

Practical rule: coding → Claude Code; “which tokens cannot leave the machine or must run on a schedule” → Ollama. Same-host memory scheduling is in the L2-Q03 follow-up; here we only fix “not either-or.”

Model toy vs private inference layer

Same command ollama run qwen3:8b, two mindsets, two architectures:

Dimension	Model toy (personal tryout)	L2 · Inference Service
Goal	Try prompts, compare tok/s, post screenshots	Fixed workloads: compliance summaries, log classification, embedding, nightly batch
CI relationship	None	Can stagger with L1 Runner (CI by day, inference by night), shared L0 memory budget
vs L3	Either-or: “local or Claude”	Parallel: coding via API, subtasks via `localhost:11434`
Success	Model responds	SLA: latency cap, pinned model version, alerts on failure, tokens never leave machine
Machine	Laptop sleeps when lid closes	Cloud Mac 24/7: Inference Service observable with Runner / Agent on one stack

In one line: toys ask “can it run”; L2 asks “what runs, when, and who depends on the output.” If nothing must land on private Inference, skip L2 — do not install Ollama just to “complete the Stack.”

How L2 splits from L1 and L3

Boundaries teams blur most often:

Layer	Component	Output	Question
L1	GitHub Runner	Fact	Does this commit build, test, and archive?
L2	Ollama (etc.)	Inference (via Inference Service)	Which inference must stay on-machine and be callable by Runner / agents?
L3	Claude Code	Diff (Agent shape)	How should the repo change? (most teams use Claude API)

L2 does not produce Diff or Fact. It will not turn PR checks green or replace xcodebuild. Typical wiring: Claude Code edits → Runner verifies → a step or sidecar calls Ollama for log summaries, vulnerability pattern match, private embedding into a vector store — outputs feed Context (L4) or human review, not “build passed” by themselves.

L2 in the five-layer diagram: tier ≠ call order

The series-wide diagram lives in Runner piece · five-layer map. Ollama sits below Claude Code because Inference carries the “private compute” foundation — not because you must ollama serve before opening Claude Code.

Excerpt · full diagram in Runner article

  Claude Code  L3 · Diff
       ↑ parallel, not a dependency
  Ollama       L2 · Inference Service (optional)
       ↑
  GitHub Runner L1 · Fact
       ↑
  Cloud Mac    L0 · infrastructure

Remember: Claude API and Ollama can coexist — API for coding Diff, Ollama for on-machine Inference. That is not “local model replaces Claude”; it is how Cloud Mac AI Stack differs from single-machine chat products.

Beyond compliance: always-on value of L2

L2 is often framed as “finance / healthcare / enterprise only.” Compliance is a strong signal, but many Cloud Mac users are indie developers, AI founders, and small teams without a compliance ticket who still need L2.

The difference is work shape:

Claude Code (L3) · on demand
  you open it → inference starts
  you close it → main-path inference ends

Ollama (L2) · can run 24/7
  process stays up → cron / sidecar calls anytime
  does not need you at the terminal

These jobs are not agent chat, not CI, not daily coding — they are real long-running inference services, for example:

Hourly Runner / app log compaction
Nightly embedding rebuild and knowledge-base cleanup
Auto daily reports, anomaly buckets, routing drafts
Updating Context (L4) data while Claude Code is offline

A laptop that sleeps when closed cannot host that as infrastructure. Staggered with L1 — Fact by day, Inference by night — is often more realistic for small teams than “buy another GPU cloud hour,” if L2 is operated as a Service, not an occasional terminal experiment.

Local Mac can run; Cloud Mac can operate

People ask: “My Mac mini can brew install ollama — why Cloud Mac?” Fair — local Mac = can run; this article is about operating it as infrastructure.

Dimension	Local Mac / laptop	Cloud Mac (L0)
Availability	Stops when lid closes, sleeps, travels	24/7 always on, stable egress
Stack fit	Ollama often a lone endpoint	Runner, Claude Code on one observable, schedulable stack
L2 shape	Often `ollama run` tryouts	Inference Service: health checks, model pin, sidecar / cron
Dependents	Usually just you	CI, agents, cron, team scripts

Ollama does not exist because of Cloud Mac; Cloud Mac’s value is turning Ollama from a terminal command into a 24/7, observable, schedulable Inference Service that Runner and agents can share. Otherwise readers think this is “just another Ollama article,” not why private inference belongs in the Cloud Mac AI Stack.

For L0 choice see Cloud Mac vs local Mac AI workstation; we do not repeat rental specs here — only that in Stack terms, L2 should sit on a node you can operate.

Workloads that belong on L2

Signals you should treat Ollama as L2 infrastructure, not a toy:

Compliance / air-gap edge — code and logs may live on Cloud Mac, but inference requests must not hit public APIs; L2 runs redacted subtasks (classification, summary, PII detection).
Embedding / rerank — CodeGraph or RAG needs stable, version-pinned local embedding models instead of API embedding cost and opaque data paths.
High-frequency small models, batchable — e.g. nightly CI log classification; 7B–14B quants on Apple Silicon are often enough with predictable cost (vs fixed node vs hourly GPU).
Stagger with L3 — Claude Code + Runner fill memory by day; ollama batch at night instead of treating 24GB as “everything maxed all day.”
Fast path in multi-agent setups — hard edits still via Claude; classification, routing, drafts via local 8B to cut API tokens (desktop agents like OpenHuman use a similar split — see OpenHuman notes).

Conversely, if you only ship iOS, code entirely via Claude API, and have no scheduled local inference, prioritize L1 Runner first — get Fact on push, then consider private Inference.

Series read order: Runner → Ollama → Claude Code

If you follow the Stack series by output:

L1 · Fact — GitHub Runner execution engine: after push, does the code really build, test, and ship?
L2 · Inference Service — this article: which tokens must stay on Cloud Mac and be called as a service.
L3 · Diff — Claude Code workstation: how the repo should change (mostly API).

Foundation: Cloud Mac vs local Mac AI workstation. Context (L4) and CodeGraph articles will connect to Inference outputs later.

Typical misjudgment: Ollama installed, Stack still API-only

brew install ollama on Cloud Mac, pull two models, team celebrates “AI stack complete.”
Daily work stays 100% Claude Code + Anthropic API; Ollama untouched for a week.
Idle models hold memory; Claude Code and Runner fight for what's left, swap climbs (see 16GB vs 24GB).
Lead asks: “Why do we still pay for Cloud Mac?” — because L0 is only hardware; no L2 workloads were defined.

Fix is not another GUI — name 1–2 pipelines that must use L2 (e.g. only failed CI log summaries via Ollama; only embedding via nomic-embed-text), pin model versions, monitor port 11434 health. Parallel scheduling is in L2-Q03 · memory scheduling.

Who can skip L2

Good fit for L2 (Ollama on Cloud Mac)	Can skip L2 for now
Compliance: inference must not leave machine, or air-gap subtasks	Coding and review entirely via Claude API
Self-hosted RAG / CodeGraph needs local embedding	No vector index, no local batch jobs
7B–14B for high-frequency small tasks to cut API cost	Only occasional chat to try models
Memory stagger planned with L1 (CI by day / inference by night)	Machine only runs Claude Code, no Runner or pipelines
Indie / small team wants 24/7 logs, embedding, daily-report inference	No cron or sidecar will ever call on-machine models

Several long “Ollama-related” pieces already exist — division of labor:

L2-Q02 · 16GB vs 24GB — memory and swap measurements; answers “which RAM tier,” not “where Ollama sits in the Stack.”
L2-Q04 · M4 vs GPU cloud — billing and scale boundaries; this piece does not compare price.
L2-Q05 · Core ML — Apple-native inference path; can coexist with Ollama, different runtime.
L3 · Claude Code workstation — coding experience; this piece adds L2 and how API coding and local Inference coexist.

Rollout order: Fact before Inference

Aligned with Runner piece · rollout order:

L0 — Cloud Mac with always-on macOS.
L1 — Runner so push → green/red is repeatable.
L2 — Ollama only for defined private Inference pipelines (this article).
L3–L5 — Claude Code, MCP, OpenHands after stable Fact + (optional) Inference.

Stack L2 before L1 and you get cheerful local summaries while xcodebuild still fails on Linux — Inference cannot replace Fact.

L2 series

This article is the L2 foundation (scope and boundaries). Same series next:

Part	Topic	Status
① · this page	Ollama as private inference layer (Inference)	Published
② · Published	AI workload scheduling on Mac mini: avoid Swap with Ollama, Claude Code, and GitHub Runner	Published
③	Model pin, health checks, CI-side Ollama calls	Planned

Five-layer responsibility map: Cloud Mac AI Stack five-layer diagram.

FAQ

Claude API is inference too — why a separate Inference layer?
Claude Code produces Diff; L2 produces Token (summaries, embedding, batch, etc.). Claude Code ⊂ Inference use cases, not the reverse.

Ollama or Claude Code — pick one?
No. Coding via L3 API; on-machine or scheduled inference via L2 — see comparison table.

Must I install Ollama before Claude Code?
No. L2 is optional and parallels L3.

Do I need L2 without compliance?
If you need 24/7 inference (logs, embedding, daily reports), indie and small teams still benefit; skip if you only occasionally try models.

My Mac mini runs Ollama — why Cloud Mac?
Local can run; Cloud Mac can operate an Inference Service — 24/7, same stack as Runner/Agent, schedulable and observable. See local vs Cloud Mac.

Ollama vs GitHub Runner — which first?
L1 then L2. Vertical order in the diagram is responsibility tiers, not boot order.

How is this different from the 16GB vs 24GB article?
This defines Stack placement; memory benchmarks in 16GB vs 24GB, cost in M4 vs GPU cloud.

L2 series · follow-up

AI Workload Scheduling on Mac mini: How to Avoid Swap with Ollama + Claude Code + GitHub Runner

L2-Q03 · Memory Scheduling Layer: scheduling fixes for Swap and sluggish CI, including a 30-second runbook.

Read L2-Q03 · AI Workload Scheduling

Ollama is the private inference layer in the Cloud Mac AI Stack — not a local model toy