When you run Ollama, Claude Code, and GitHub Actions together on a Mac mini or Cloud Mac, the most common failure mode is not “the machine is too slow” — it is Swap, a sluggish CLI, and CI builds that drag.
This piece does not cover tool installation. It explains why three AI workloads on one host cause memory churn, and how scheduling prevents it. You get a bad-schedule example, a memory budget, threshold-based rules, a 30-second runbook, and full scripts. It does not repeat the 16GB vs 24GB benchmarks.
TL;DR
- Ollama always-on + CI burst → Swap is very common on M4 (see bad schedule)
- The fix is usually not a bigger Mac — assign schedules to burst, interactive, and background workloads
ollama stopbefore CI is the highest-leverage, easiest step (30-second version)
As you read, keep the AI workload scheduling model in mind: in Cloud Mac setups we observe, most Swap and OOM come from scheduling, not from running out of RAM on paper.
The real issue: usually not “not enough RAM,” but no priority
The surface story is “local model idles + CI spikes + coding agent feels slow” — and teams misread it as “we need more RAM.”
On Cloud Mac hosts we watch, most Swap and OOM trace to workload scheduling, not absolute memory shortage. Three workload classes share one Mac mini (details in the next section):
- Burst — GitHub Runner /
xcodebuild: +4–8 GB on push - Interactive — Claude Code: IDE + terminal + large-repo indexing
- Background — local inference (Ollama): 8B loaded and idle still holds 5–7 GB
By default all three compete equally for unified memory — nobody yields. The fix is to define workload priority and triggers, and treat background inference as unloadable and preemptible by CI events, not “installed, so always resident.”
Bad schedule example (common on M4, not a one-off)
This is a combination we have seen on small-team Cloud Mac hosts — treat it as a “do not onboard like this” pattern:
Bad pattern · all three workloads "always on" qwen3:8b always loaded → 5–7 GB (background, zero calls) push triggers xcodebuild → +4–8 GB peak (burst) Claude Code indexing large repo → +2–3 GB (interactive) Observed on 24GB M4 Mac mini Swap 0 → 2.1 GB xcodebuild link stage latency ~+40% Claude Code terminal noticeably sluggish
On M4 unified memory, 8B resident + CI burst + interactive coding at once very often produces Swap. The point of the example: workload scheduling is not polish — it is a prerequisite when multiple workloads share one host.
Three workload shapes: burst / interactive / background
Scheduling by clock alone is not enough. On Cloud Mac, the three classes have very different memory curves:
| Shape | Examples | Layer | Memory profile | Schedule priority |
|---|---|---|---|---|
| Burst | xcodebuild, linking, simulators |
L1 Runner | Spiky, hard to predict | Highest — Fact must not fail |
| Interactive | Claude Code, IDE, SSH sessions | L3 | Moderate, human in the loop | High — reasoning via API, not local LLM weights |
| Background | Ollama embedding, log summaries | L2 | Deferrable, unloadable | Lowest — must yield to burst |
In one line: L2 is the only Stack layer that can be kicked off the machine — not because it is unimportant, but because its jobs are mostly async and retryable.
Memory budget: what each workload costs
Below, Ollama / Claude Code / GitHub Runner are example components. Numbers come from M4 Mac mini measurements (steady state, not compile spikes):
| Component | Layer | Typical use | Peak notes |
|---|---|---|---|
| macOS + system cache | L0 | 3–4 GB | Relatively stable |
| Claude Code workspace | L3 | 1–3 GB | Reasoning via API — no model weights |
| GitHub Runner job | L1 | 2–6 GB (steady) | Link stage +4–8 GB instant |
| Ollama · qwen3:8b | L2 | 5–7 GB | Released with ollama stop |
| Ollama · qwen3:14b | L2 | 9–13 GB | With burst, Swap 2GB+ is easy |
| Ollama · nomic-embed-text | L2 | 0.3–0.8 GB | Light background you can keep by day |
Rough daytime math on 24GB: “coding + CI + 8B resident” ≈ 17 GB — still headroom. Add 14B resident and you blow past 22 GB fast. The budget table answers “can we”; scheduling answers “who should hold memory when.”
Scheduling model: time windows to memory_pressure thresholds
Start with a day/night table (mode ② below). The more operable upgrade is threshold scheduling on memory pressure — you do not rely on someone remembering “22:00 is batch time”:
AI Workload Scheduler · L2-Q03 recommended rules When memory_pressure enters warn / critical (or equivalent > ~70%): → auto ollama stop qwen3:8b / qwen3:14b When memory_pressure is normal (< ~50%) and idle > 10 min: → auto preload nomic-embed-text (keepalive 10m) When CI event trigger (Runner job start): → force CI mode: stop all large Ollama models; priority L1 > L3 > L2 When CI job succeeds and memory recovers: → async L2 batch (log summary / embedding rebuild)
Time windows are the baseline; threshold scheduling is the upgrade. Small teams can copy the runbook + CI stop first; add a memory_pressure guard script once stable (see Runbook).
Three baseline schedule modes
Use these with threshold rules:
| Mode | Approach | Best for |
|---|---|---|
| ① Light coexistence | Daytime: only nomic-embed-text resident; load 8B/14B on demand |
16GB Mac mini, coding-first |
| ② Time split | 09–18 coding+CI / 22–06 nightly batch | 24GB, scheduled embedding / log jobs |
| ③ CI yield | ollama stop before job; async L2 after |
Frequent push, high xcodebuild peaks |
Recommended combo: ① + ③ for everyone by default; ② for nightly batch; threshold guard as safety net.
Pipeline split: what runs on L2 vs L3
Pin the pipeline before schedules — otherwise Ollama still idles (see L2-Q01 · typical misjudgment):
| Task | Layer | Scheduling note |
|---|---|---|
| Edit repo, generate patches | L3 Claude Code (API) | No on-machine large model |
| Build, test, archive | L1 Runner | Burst highest priority; stop L2 before peak |
| CI failure log summary | L2 qwen3:8b |
Async after job, or nightly batch |
| CodeGraph / RAG embedding | L2 nomic-embed-text |
Can stay resident by day (<1GB) |
Runner peaks and CI staggering
L1 Runner produces Fact — inference cannot replace build results. Minimum CI-side change:
# .github/workflows/ios.yml · self-hosted macOS runner - name: Enter CI mode — free memory for xcodebuild run: | ollama ps ollama stop qwen3:8b 2>/dev/null || true ollama stop qwen3:14b 2>/dev/null || true sleep 30 # wait for memory reclaim; do not start xcodebuild same second - name: Build run: xcodebuild ...
Measured: on 24GB M4, ollama stop qwen3:8b frees 5–7 GB in about 5–15 seconds. If Swap already happened, full reclaim can take minutes — so CI stop must run at least 30 seconds before the build.
Runbook: 30-second version and full scripts
30-second version (most people only need these three blocks)
Skip the full script? Copy the three blocks below — they cover ~80% of cases:
① Before CI (highest leverage) — in GitHub Actions or a Runner hook:
ollama stop qwen3:8b ollama stop qwen3:14b sleep 30
② Daytime — keep only light embedding, no large model weights:
ollama run nomic-embed-text --keepalive 30m
③ Night — load 8B in the batch window:
ollama run qwen3:8b # then run your log summary / embedding rebuild script
Full version (production runbook)
For LaunchAgent / cron / multi-environment reuse, save as ~/bin/cloud-mac-stack-runbook.sh:
Full runbook · subcommands
day-start · ci-pre · ci-post · night-batch
#!/usr/bin/env bash # cloud-mac-stack-runbook.sh — L2-Q03 standard runbook set -euo pipefail OLLAMA_HOST="${OLLAMA_HOST:-127.0.0.1:11434}" export OLLAMA_MAX_LOADED_MODELS=1 ensure_ollama() { curl -sf "http://${OLLAMA_HOST}/api/tags" >/dev/null || ollama serve & sleep 2 } ci_pre() { # Before CI: force CI mode, L1 wins ollama ps || true ollama stop qwen3:8b 2>/dev/null || true ollama stop qwen3:14b 2>/dev/null || true sleep 30 } ci_post() { # After CI: restore light embedding (lowest background tier) ensure_ollama ollama run nomic-embed-text --keepalive 10m } day_start() { # Day login / boot: embedding only or full stop ensure_ollama ollama stop qwen3:8b 2>/dev/null || true ollama stop qwen3:14b 2>/dev/null || true ollama run nomic-embed-text --keepalive 30m } night_batch() { # Nightly batch (cron: 0 22 * * *) ensure_ollama ollama run qwen3:8b --keepalive 6h # ./your-log-summary-or-embed-rebuild.sh } case "${1:-}" in day-start) day_start ;; ci-pre) ci_pre ;; ci-post) ci_post ;; night-batch) night_batch ;; *) echo "Usage: $0 {day-start|ci-pre|ci-post|night-batch}"; exit 1 ;; esac
Memory guard (cron every 5 minutes, or LaunchAgent): when system memory pressure rises, unload large models automatically — minimal threshold scheduling.
#!/usr/bin/env bash # memory-guard.sh — memory_pressure threshold guard PRESSURE=$(memory_pressure 2>/dev/null | head -1 || true) if echo "$PRESSURE" | grep -qiE 'warn|critical|urgent'; then logger -t cloud-mac-stack "memory guard: stopping Ollama 8B/14B ($PRESSURE)" ollama stop qwen3:8b 2>/dev/null || true ollama stop qwen3:14b 2>/dev/null || true fi # Optional: when pressure normal and no Runner job, restore embedding # if echo "$PRESSURE" | grep -qi 'normal'; then ... fi
Wiring examples:
- GitHub Actions — first step
ci-pre, last stepci-post - LaunchAgent —
day-starton login; cronnight-batchat 22:00 - cron —
*/5 * * * * /path/memory-guard.sh
How to schedule a 16GB Mac mini
- No 14B resident; 8B only in
night-batch - Daytime runbook:
day-startonly (embedding or full stop) - Every CI run must call
ci-pre, no exceptions - Default “desktop + 8B + Claude Code all online” → choose 24GB; scheduling cannot fix the hardware floor
16GB rule of thumb: no large models by day, one model one job at night, always ci-pre before CI. 24GB rule of thumb: embedding OK by day, stagger 8B with CI, 14B nights only.
Decision table: which strategy fits you
| Your situation | Recommendation |
|---|---|
| 24GB · few pushes · mainly Claude Code | day-start + embedding; no nightly batch |
| 24GB · daily CI · want log summaries | ci-pre/post + night-batch + memory guard |
| 16GB · must run local 8B | night-batch only; Claude API by day |
| Ollama still zero calls for a week | Back to L2-Q01 and define the pipeline first |
Series placement · Cloud Mac AI Stack
L2-Q03 · Memory Scheduling Layer — answers “how do I avoid Swap on Mac mini” externally; internally it continues L2-Q01 private inference layer with same-host scheduling:
- L2-Q01 — what Inference is (placement)
- L2-Q03 · this article — Memory Scheduling Layer (same-machine scheduling)
- Planned — model pin, port 11434 health checks, CI-side Ollama calls
- Downstream — L4 Context + MCP scheduling (tomorrow L4-Q03)
Not an Ollama tutorial or pure CI tuning piece — it is the first layer of an AI workload scheduler on Apple Silicon.
How this relates to published articles
- L2-Q01 · private inference layer — L2 placement; this article is the scheduling follow-up.
- L2-Q02 · 16GB vs 24GB — Swap numbers source; benchmarks not repeated here.
- L1-Q01 · Runner — burst has highest priority.
- L3 · Claude Code — interactive main path.
FAQ
Will Ollama, Claude Code, and GitHub Actions together cause Swap on Mac mini?
Yes, if burst, interactive, and background workloads have no priority. See the real issue and bad schedule.
Does Ollama need to run all the time?
No. Treat local inference as schedulable: keep only nomic-embed-text by day; load large models on demand or in nightly batch.
Should I stop Ollama during CI builds?
Yes, recommended. See 30-second runbook · before CI, or full ci-pre.
Can Claude Code and Ollama run at the same time?
Yes. Coding uses the API; on-machine contention is model weights and CI peaks. Use stop-before-CI or time windows.
On Cloud Mac, is OOM just “not enough RAM”?
In setups we observe, most OOM traces to scheduling, not absolute shortage. See L2-Q02 measurements for the 16GB hard floor.
Time windows vs memory_pressure thresholds—which should I use?
Start with the 30-second runbook; add memory-guard.sh once stable (full runbook above).
How is this different from L2-Q01?
Q01 defines the Inference Service role; Q03 (Memory Scheduling Layer) explains how to schedule AI workloads on one machine.
Cloud Mac AI Stack · coming tomorrow
Claude Code + MCP: wiring GitHub, local files, and APIs into one chain
Once L2 scheduling is pinned, the next layer is L4 Context: how MCP connects GitHub, repo files, and APIs into the Claude Code workflow.
Back to blog · full Cloud Mac AI Stack series