Is there a command set I can copy in 30 seconds?

Yes. Before CI: ollama stop qwen3:8b && ollama stop qwen3:14b && sleep 30. Daytime: ollama run nomic-embed-text --keepalive 30m. Night: ollama run qwen3:8b.

How is this article different from L2-Q01?

L2-Q01 defines the Inference Service role; L2-Q03 (Memory Scheduling Layer) explains how to schedule AI workloads on one machine.

AI Workload Scheduling on Mac mini: How to Avoid Swap from Ollama + Claude Code + GitHub Runner

Q: Will Ollama, Claude Code, and GitHub Actions together cause Swap on Mac mini?

Yes, if burst, interactive, and background workloads all run at full tilt with no priority. Most Swap can be avoided with ollama stop before CI and memory_pressure threshold scheduling.

Q: Does Ollama need to run all the time?

No. Treat Ollama as a schedulable resource: keep only a light embedding model during the day; load larger models on demand or in a nightly batch window.

Q: Should I stop Ollama during CI builds?

Yes, recommended. xcodebuild peaks plus a loaded 8B model are a common Swap trigger on M4; run ollama stop before the job and sleep 30 seconds.

Q: Can Claude Code and Ollama run at the same time?

Yes. Claude Code’s main path uses the API; on-machine contention is Ollama weights and CI compile peaks. Use time windows or ollama stop before CI.

Q: On Cloud Mac, is OOM just “not enough RAM”?

In our Cloud Mac observations, most OOM cases trace to workload scheduling, not absolute memory shortage.

Q: Time windows vs memory_pressure thresholds—which should I use?

Start with the 30-second runbook; once stable, add memory-guard.sh (see full runbook below).

When you run Ollama, Claude Code, and GitHub Actions together on a Mac mini or Cloud Mac, the most common failure mode is not “the machine is too slow” — it is Swap, a sluggish CLI, and CI builds that drag.

This piece does not cover tool installation. It explains why three AI workloads on one host cause memory churn, and how scheduling prevents it. You get a bad-schedule example, a memory budget, threshold-based rules, a 30-second runbook, and full scripts. It does not repeat the 16GB vs 24GB benchmarks.

TL;DR

Ollama always-on + CI burst → Swap is very common on M4 (see bad schedule)
The fix is usually not a bigger Mac — assign schedules to burst, interactive, and background workloads
ollama stop before CI is the highest-leverage, easiest step (30-second version)

As you read, keep the AI workload scheduling model in mind: in Cloud Mac setups we observe, most Swap and OOM come from scheduling, not from running out of RAM on paper.

Workload Scheduling

30s

Minimum runbook

brew install steps

The real issue: usually not “not enough RAM,” but no priority

The surface story is “local model idles + CI spikes + coding agent feels slow” — and teams misread it as “we need more RAM.”

On Cloud Mac hosts we watch, most Swap and OOM trace to workload scheduling, not absolute memory shortage. Three workload classes share one Mac mini (details in the next section):

Burst — GitHub Runner / xcodebuild: +4–8 GB on push
Interactive — Claude Code: IDE + terminal + large-repo indexing
Background — local inference (Ollama): 8B loaded and idle still holds 5–7 GB

By default all three compete equally for unified memory — nobody yields. The fix is to define workload priority and triggers, and treat background inference as unloadable and preemptible by CI events, not “installed, so always resident.”

Bad schedule example (common on M4, not a one-off)

This is a combination we have seen on small-team Cloud Mac hosts — treat it as a “do not onboard like this” pattern:

Bad pattern · all three workloads "always on"

  qwen3:8b  always loaded          →  5–7 GB (background, zero calls)
  push triggers xcodebuild         →  +4–8 GB peak (burst)
  Claude Code indexing large repo  →  +2–3 GB (interactive)

Observed on 24GB M4 Mac mini
  Swap  0 → 2.1 GB
  xcodebuild link stage latency  ~+40%
  Claude Code terminal noticeably sluggish

On M4 unified memory, 8B resident + CI burst + interactive coding at once very often produces Swap. The point of the example: workload scheduling is not polish — it is a prerequisite when multiple workloads share one host.

Three workload shapes: burst / interactive / background

Scheduling by clock alone is not enough. On Cloud Mac, the three classes have very different memory curves:

Shape	Examples	Layer	Memory profile	Schedule priority
Burst	`xcodebuild`, linking, simulators	L1 Runner	Spiky, hard to predict	Highest — Fact must not fail
Interactive	Claude Code, IDE, SSH sessions	L3	Moderate, human in the loop	High — reasoning via API, not local LLM weights
Background	Ollama embedding, log summaries	L2	Deferrable, unloadable	Lowest — must yield to burst

In one line: L2 is the only Stack layer that can be kicked off the machine — not because it is unimportant, but because its jobs are mostly async and retryable.

Memory budget: what each workload costs

Below, Ollama / Claude Code / GitHub Runner are example components. Numbers come from M4 Mac mini measurements (steady state, not compile spikes):

Component	Layer	Typical use	Peak notes
macOS + system cache	L0	3–4 GB	Relatively stable
Claude Code workspace	L3	1–3 GB	Reasoning via API — no model weights
GitHub Runner job	L1	2–6 GB (steady)	Link stage +4–8 GB instant
Ollama · qwen3:8b	L2	5–7 GB	Released with `ollama stop`
Ollama · qwen3:14b	L2	9–13 GB	With burst, Swap 2GB+ is easy
Ollama · nomic-embed-text	L2	0.3–0.8 GB	Light background you can keep by day

Rough daytime math on 24GB: “coding + CI + 8B resident” ≈ 17 GB — still headroom. Add 14B resident and you blow past 22 GB fast. The budget table answers “can we”; scheduling answers “who should hold memory when.”

Scheduling model: time windows to memory_pressure thresholds

Start with a day/night table (mode ② below). The more operable upgrade is threshold scheduling on memory pressure — you do not rely on someone remembering “22:00 is batch time”:

AI Workload Scheduler · L2-Q03 recommended rules

When memory_pressure enters warn / critical (or equivalent > ~70%):
  → auto ollama stop qwen3:8b / qwen3:14b

When memory_pressure is normal (< ~50%) and idle > 10 min:
  → auto preload nomic-embed-text (keepalive 10m)

When CI event trigger (Runner job start):
  → force CI mode: stop all large Ollama models; priority L1 > L3 > L2

When CI job succeeds and memory recovers:
  → async L2 batch (log summary / embedding rebuild)

Time windows are the baseline; threshold scheduling is the upgrade. Small teams can copy the runbook + CI stop first; add a memory_pressure guard script once stable (see Runbook).

Three baseline schedule modes

Use these with threshold rules:

Mode	Approach	Best for
① Light coexistence	Daytime: only `nomic-embed-text` resident; load 8B/14B on demand	16GB Mac mini, coding-first
② Time split	09–18 coding+CI / 22–06 nightly batch	24GB, scheduled embedding / log jobs
③ CI yield	`ollama stop` before job; async L2 after	Frequent push, high xcodebuild peaks

Recommended combo: ① + ③ for everyone by default; ② for nightly batch; threshold guard as safety net.

Pipeline split: what runs on L2 vs L3

Pin the pipeline before schedules — otherwise Ollama still idles (see L2-Q01 · typical misjudgment):

Task	Layer	Scheduling note
Edit repo, generate patches	L3 Claude Code (API)	No on-machine large model
Build, test, archive	L1 Runner	Burst highest priority; stop L2 before peak
CI failure log summary	L2 `qwen3:8b`	Async after job, or nightly batch
CodeGraph / RAG embedding	L2 `nomic-embed-text`	Can stay resident by day (<1GB)

Runner peaks and CI staggering

L1 Runner produces Fact — inference cannot replace build results. Minimum CI-side change:

# .github/workflows/ios.yml · self-hosted macOS runner
- name: Enter CI mode — free memory for xcodebuild
  run: |
    ollama ps
    ollama stop qwen3:8b 2>/dev/null || true
    ollama stop qwen3:14b 2>/dev/null || true
    sleep 30   # wait for memory reclaim; do not start xcodebuild same second

- name: Build
  run: xcodebuild ...

Measured: on 24GB M4, ollama stop qwen3:8b frees 5–7 GB in about 5–15 seconds. If Swap already happened, full reclaim can take minutes — so CI stop must run at least 30 seconds before the build.

Runbook: 30-second version and full scripts

30-second version (most people only need these three blocks)

Skip the full script? Copy the three blocks below — they cover ~80% of cases:

① Before CI (highest leverage) — in GitHub Actions or a Runner hook:

ollama stop qwen3:8b
ollama stop qwen3:14b
sleep 30

② Daytime — keep only light embedding, no large model weights:

ollama run nomic-embed-text --keepalive 30m

③ Night — load 8B in the batch window:

ollama run qwen3:8b
# then run your log summary / embedding rebuild script

Full version (production runbook)

For LaunchAgent / cron / multi-environment reuse, save as ~/bin/cloud-mac-stack-runbook.sh:

Full runbook · subcommands

day-start · ci-pre · ci-post · night-batch

#!/usr/bin/env bash
# cloud-mac-stack-runbook.sh — L2-Q03 standard runbook
set -euo pipefail

OLLAMA_HOST="${OLLAMA_HOST:-127.0.0.1:11434}"
export OLLAMA_MAX_LOADED_MODELS=1

ensure_ollama() {
  curl -sf "http://${OLLAMA_HOST}/api/tags" >/dev/null || ollama serve &
  sleep 2
}

ci_pre() {
  # Before CI: force CI mode, L1 wins
  ollama ps || true
  ollama stop qwen3:8b 2>/dev/null || true
  ollama stop qwen3:14b 2>/dev/null || true
  sleep 30
}

ci_post() {
  # After CI: restore light embedding (lowest background tier)
  ensure_ollama
  ollama run nomic-embed-text --keepalive 10m
}

day_start() {
  # Day login / boot: embedding only or full stop
  ensure_ollama
  ollama stop qwen3:8b 2>/dev/null || true
  ollama stop qwen3:14b 2>/dev/null || true
  ollama run nomic-embed-text --keepalive 30m
}

night_batch() {
  # Nightly batch (cron: 0 22 * * *)
  ensure_ollama
  ollama run qwen3:8b --keepalive 6h
  # ./your-log-summary-or-embed-rebuild.sh
}

case "${1:-}" in
  day-start)   day_start ;;
  ci-pre)      ci_pre ;;
  ci-post)     ci_post ;;
  night-batch) night_batch ;;
  *) echo "Usage: $0 {day-start|ci-pre|ci-post|night-batch}"; exit 1 ;;
esac

Memory guard (cron every 5 minutes, or LaunchAgent): when system memory pressure rises, unload large models automatically — minimal threshold scheduling.

#!/usr/bin/env bash
# memory-guard.sh — memory_pressure threshold guard
PRESSURE=$(memory_pressure 2>/dev/null | head -1 || true)

if echo "$PRESSURE" | grep -qiE 'warn|critical|urgent'; then
  logger -t cloud-mac-stack "memory guard: stopping Ollama 8B/14B ($PRESSURE)"
  ollama stop qwen3:8b 2>/dev/null || true
  ollama stop qwen3:14b 2>/dev/null || true
fi

# Optional: when pressure normal and no Runner job, restore embedding
# if echo "$PRESSURE" | grep -qi 'normal'; then ... fi

Wiring examples:

GitHub Actions — first step ci-pre, last step ci-post
LaunchAgent — day-start on login; cron night-batch at 22:00
cron — */5 * * * * /path/memory-guard.sh

How to schedule a 16GB Mac mini

No 14B resident; 8B only in night-batch
Daytime runbook: day-start only (embedding or full stop)
Every CI run must call ci-pre, no exceptions
Default “desktop + 8B + Claude Code all online” → choose 24GB; scheduling cannot fix the hardware floor

16GB rule of thumb: no large models by day, one model one job at night, always ci-pre before CI. 24GB rule of thumb: embedding OK by day, stagger 8B with CI, 14B nights only.

Decision table: which strategy fits you

Your situation	Recommendation
24GB · few pushes · mainly Claude Code	`day-start` + embedding; no nightly batch
24GB · daily CI · want log summaries	`ci-pre/post` + `night-batch` + memory guard
16GB · must run local 8B	`night-batch` only; Claude API by day
Ollama still zero calls for a week	Back to L2-Q01 and define the pipeline first

Series placement · Cloud Mac AI Stack

L2-Q03 · Memory Scheduling Layer — answers “how do I avoid Swap on Mac mini” externally; internally it continues L2-Q01 private inference layer with same-host scheduling:

L2-Q01 — what Inference is (placement)
L2-Q03 · this article — Memory Scheduling Layer (same-machine scheduling)
Planned — model pin, port 11434 health checks, CI-side Ollama calls
Downstream — L4-Q03 · MCP triple-connect Hub

Not an Ollama tutorial or pure CI tuning piece — it is the first layer of an AI workload scheduler on Apple Silicon.

L2-Q01 · private inference layer — L2 placement; this article is the scheduling follow-up.
L2-Q02 · 16GB vs 24GB — Swap numbers source; benchmarks not repeated here.
L1-Q01 · Runner — burst has highest priority.
L3 · Claude Code — interactive main path.

FAQ

Will Ollama, Claude Code, and GitHub Actions together cause Swap on Mac mini?
Yes, if burst, interactive, and background workloads have no priority. See the real issue and bad schedule.

Does Ollama need to run all the time?
No. Treat local inference as schedulable: keep only nomic-embed-text by day; load large models on demand or in nightly batch.

Should I stop Ollama during CI builds?
Yes, recommended. See 30-second runbook · before CI, or full ci-pre.

Can Claude Code and Ollama run at the same time?
Yes. Coding uses the API; on-machine contention is model weights and CI peaks. Use stop-before-CI or time windows.

On Cloud Mac, is OOM just “not enough RAM”?
In setups we observe, most OOM traces to scheduling, not absolute shortage. See L2-Q02 measurements for the 16GB hard floor.

Time windows vs memory_pressure thresholds—which should I use?
Start with the 30-second runbook; add memory-guard.sh once stable (full runbook above).

How is this different from L2-Q01?
Q01 defines the Inference Service role; Q03 (Memory Scheduling Layer) explains how to schedule AI workloads on one machine.

Cloud Mac AI Stack · L4 Hub

Claude Code MCP: GitHub, CodeGraph & API triple-connect

L2 scheduling pinned → open the L4 Hub for routing to Setup, Architecture, and CodeGraph—without rereading memory scheduling here.

Read L4-Q03 · MCP Hub