AI Workload Scheduling on Mac mini: How to Avoid Swap from Ollama + Claude Code + GitHub Runner

L2-Q03 · Memory Scheduling Layer

2026.06.04  ·  ~14 min read  ·  ops runbook, not an install tutorial

Unified memory on Mac mini and AI workload scheduling: Ollama, Claude Code, and GitHub Actions scheduled to avoid Swap

When you run Ollama, Claude Code, and GitHub Actions together on a Mac mini or Cloud Mac, the most common failure mode is not “the machine is too slow” — it is Swap, a sluggish CLI, and CI builds that drag.

This piece does not cover tool installation. It explains why three AI workloads on one host cause memory churn, and how scheduling prevents it. You get a bad-schedule example, a memory budget, threshold-based rules, a 30-second runbook, and full scripts. It does not repeat the 16GB vs 24GB benchmarks.

TL;DR

  • Ollama always-on + CI burst → Swap is very common on M4 (see bad schedule)
  • The fix is usually not a bigger Mac — assign schedules to burst, interactive, and background workloads
  • ollama stop before CI is the highest-leverage, easiest step (30-second version)

As you read, keep the AI workload scheduling model in mind: in Cloud Mac setups we observe, most Swap and OOM come from scheduling, not from running out of RAM on paper.

AI
Workload Scheduling
30s
Minimum runbook
0
brew install steps

The real issue: usually not “not enough RAM,” but no priority

The surface story is “local model idles + CI spikes + coding agent feels slow” — and teams misread it as “we need more RAM.”

On Cloud Mac hosts we watch, most Swap and OOM trace to workload scheduling, not absolute memory shortage. Three workload classes share one Mac mini (details in the next section):

  • BurstGitHub Runner / xcodebuild: +4–8 GB on push
  • InteractiveClaude Code: IDE + terminal + large-repo indexing
  • Background — local inference (Ollama): 8B loaded and idle still holds 5–7 GB

By default all three compete equally for unified memory — nobody yields. The fix is to define workload priority and triggers, and treat background inference as unloadable and preemptible by CI events, not “installed, so always resident.”

Bad schedule example (common on M4, not a one-off)

This is a combination we have seen on small-team Cloud Mac hosts — treat it as a “do not onboard like this” pattern:

Bad pattern · all three workloads "always on"

  qwen3:8b  always loaded          →  5–7 GB (background, zero calls)
  push triggers xcodebuild         →  +4–8 GB peak (burst)
  Claude Code indexing large repo  →  +2–3 GB (interactive)

Observed on 24GB M4 Mac mini
  Swap  0 → 2.1 GB
  xcodebuild link stage latency  ~+40%
  Claude Code terminal noticeably sluggish

On M4 unified memory, 8B resident + CI burst + interactive coding at once very often produces Swap. The point of the example: workload scheduling is not polish — it is a prerequisite when multiple workloads share one host.

Three workload shapes: burst / interactive / background

Scheduling by clock alone is not enough. On Cloud Mac, the three classes have very different memory curves:

Shape Examples Layer Memory profile Schedule priority
Burst xcodebuild, linking, simulators L1 Runner Spiky, hard to predict Highest — Fact must not fail
Interactive Claude Code, IDE, SSH sessions L3 Moderate, human in the loop High — reasoning via API, not local LLM weights
Background Ollama embedding, log summaries L2 Deferrable, unloadable Lowest — must yield to burst

In one line: L2 is the only Stack layer that can be kicked off the machine — not because it is unimportant, but because its jobs are mostly async and retryable.

Memory budget: what each workload costs

Below, Ollama / Claude Code / GitHub Runner are example components. Numbers come from M4 Mac mini measurements (steady state, not compile spikes):

Component Layer Typical use Peak notes
macOS + system cache L0 3–4 GB Relatively stable
Claude Code workspace L3 1–3 GB Reasoning via API — no model weights
GitHub Runner job L1 2–6 GB (steady) Link stage +4–8 GB instant
Ollama · qwen3:8b L2 5–7 GB Released with ollama stop
Ollama · qwen3:14b L2 9–13 GB With burst, Swap 2GB+ is easy
Ollama · nomic-embed-text L2 0.3–0.8 GB Light background you can keep by day

Rough daytime math on 24GB: “coding + CI + 8B resident” ≈ 17 GB — still headroom. Add 14B resident and you blow past 22 GB fast. The budget table answers “can we”; scheduling answers “who should hold memory when.”

Scheduling model: time windows to memory_pressure thresholds

Start with a day/night table (mode ② below). The more operable upgrade is threshold scheduling on memory pressure — you do not rely on someone remembering “22:00 is batch time”:

AI Workload Scheduler · L2-Q03 recommended rules

When memory_pressure enters warn / critical (or equivalent > ~70%):
  → auto ollama stop qwen3:8b / qwen3:14b

When memory_pressure is normal (< ~50%) and idle > 10 min:
  → auto preload nomic-embed-text (keepalive 10m)

When CI event trigger (Runner job start):
  → force CI mode: stop all large Ollama models; priority L1 > L3 > L2

When CI job succeeds and memory recovers:
  → async L2 batch (log summary / embedding rebuild)

Time windows are the baseline; threshold scheduling is the upgrade. Small teams can copy the runbook + CI stop first; add a memory_pressure guard script once stable (see Runbook).

Three baseline schedule modes

Use these with threshold rules:

Mode Approach Best for
① Light coexistence Daytime: only nomic-embed-text resident; load 8B/14B on demand 16GB Mac mini, coding-first
② Time split 09–18 coding+CI / 22–06 nightly batch 24GB, scheduled embedding / log jobs
③ CI yield ollama stop before job; async L2 after Frequent push, high xcodebuild peaks

Recommended combo: ① + ③ for everyone by default; ② for nightly batch; threshold guard as safety net.

Pipeline split: what runs on L2 vs L3

Pin the pipeline before schedules — otherwise Ollama still idles (see L2-Q01 · typical misjudgment):

Task Layer Scheduling note
Edit repo, generate patches L3 Claude Code (API) No on-machine large model
Build, test, archive L1 Runner Burst highest priority; stop L2 before peak
CI failure log summary L2 qwen3:8b Async after job, or nightly batch
CodeGraph / RAG embedding L2 nomic-embed-text Can stay resident by day (<1GB)

Runner peaks and CI staggering

L1 Runner produces Fact — inference cannot replace build results. Minimum CI-side change:

# .github/workflows/ios.yml · self-hosted macOS runner
- name: Enter CI mode — free memory for xcodebuild
  run: |
    ollama ps
    ollama stop qwen3:8b 2>/dev/null || true
    ollama stop qwen3:14b 2>/dev/null || true
    sleep 30   # wait for memory reclaim; do not start xcodebuild same second

- name: Build
  run: xcodebuild ...

Measured: on 24GB M4, ollama stop qwen3:8b frees 5–7 GB in about 5–15 seconds. If Swap already happened, full reclaim can take minutes — so CI stop must run at least 30 seconds before the build.

Runbook: 30-second version and full scripts

30-second version (most people only need these three blocks)

Skip the full script? Copy the three blocks below — they cover ~80% of cases:

① Before CI (highest leverage) — in GitHub Actions or a Runner hook:

ollama stop qwen3:8b
ollama stop qwen3:14b
sleep 30

② Daytime — keep only light embedding, no large model weights:

ollama run nomic-embed-text --keepalive 30m

③ Night — load 8B in the batch window:

ollama run qwen3:8b
# then run your log summary / embedding rebuild script

Full version (production runbook)

For LaunchAgent / cron / multi-environment reuse, save as ~/bin/cloud-mac-stack-runbook.sh:

Full runbook · subcommands

day-start · ci-pre · ci-post · night-batch

#!/usr/bin/env bash
# cloud-mac-stack-runbook.sh — L2-Q03 standard runbook
set -euo pipefail

OLLAMA_HOST="${OLLAMA_HOST:-127.0.0.1:11434}"
export OLLAMA_MAX_LOADED_MODELS=1

ensure_ollama() {
  curl -sf "http://${OLLAMA_HOST}/api/tags" >/dev/null || ollama serve &
  sleep 2
}

ci_pre() {
  # Before CI: force CI mode, L1 wins
  ollama ps || true
  ollama stop qwen3:8b 2>/dev/null || true
  ollama stop qwen3:14b 2>/dev/null || true
  sleep 30
}

ci_post() {
  # After CI: restore light embedding (lowest background tier)
  ensure_ollama
  ollama run nomic-embed-text --keepalive 10m
}

day_start() {
  # Day login / boot: embedding only or full stop
  ensure_ollama
  ollama stop qwen3:8b 2>/dev/null || true
  ollama stop qwen3:14b 2>/dev/null || true
  ollama run nomic-embed-text --keepalive 30m
}

night_batch() {
  # Nightly batch (cron: 0 22 * * *)
  ensure_ollama
  ollama run qwen3:8b --keepalive 6h
  # ./your-log-summary-or-embed-rebuild.sh
}

case "${1:-}" in
  day-start)   day_start ;;
  ci-pre)      ci_pre ;;
  ci-post)     ci_post ;;
  night-batch) night_batch ;;
  *) echo "Usage: $0 {day-start|ci-pre|ci-post|night-batch}"; exit 1 ;;
esac

Memory guard (cron every 5 minutes, or LaunchAgent): when system memory pressure rises, unload large models automatically — minimal threshold scheduling.

#!/usr/bin/env bash
# memory-guard.sh — memory_pressure threshold guard
PRESSURE=$(memory_pressure 2>/dev/null | head -1 || true)

if echo "$PRESSURE" | grep -qiE 'warn|critical|urgent'; then
  logger -t cloud-mac-stack "memory guard: stopping Ollama 8B/14B ($PRESSURE)"
  ollama stop qwen3:8b 2>/dev/null || true
  ollama stop qwen3:14b 2>/dev/null || true
fi

# Optional: when pressure normal and no Runner job, restore embedding
# if echo "$PRESSURE" | grep -qi 'normal'; then ... fi

Wiring examples:

  • GitHub Actions — first step ci-pre, last step ci-post
  • LaunchAgentday-start on login; cron night-batch at 22:00
  • cron*/5 * * * * /path/memory-guard.sh

How to schedule a 16GB Mac mini

  • No 14B resident; 8B only in night-batch
  • Daytime runbook: day-start only (embedding or full stop)
  • Every CI run must call ci-pre, no exceptions
  • Default “desktop + 8B + Claude Code all online” → choose 24GB; scheduling cannot fix the hardware floor

16GB rule of thumb: no large models by day, one model one job at night, always ci-pre before CI. 24GB rule of thumb: embedding OK by day, stagger 8B with CI, 14B nights only.

Decision table: which strategy fits you

Your situation Recommendation
24GB · few pushes · mainly Claude Code day-start + embedding; no nightly batch
24GB · daily CI · want log summaries ci-pre/post + night-batch + memory guard
16GB · must run local 8B night-batch only; Claude API by day
Ollama still zero calls for a week Back to L2-Q01 and define the pipeline first

Series placement · Cloud Mac AI Stack

L2-Q03 · Memory Scheduling Layer — answers “how do I avoid Swap on Mac mini” externally; internally it continues L2-Q01 private inference layer with same-host scheduling:

  • L2-Q01 — what Inference is (placement)
  • L2-Q03 · this article — Memory Scheduling Layer (same-machine scheduling)
  • Planned — model pin, port 11434 health checks, CI-side Ollama calls
  • Downstream — L4 Context + MCP scheduling (tomorrow L4-Q03)

Not an Ollama tutorial or pure CI tuning piece — it is the first layer of an AI workload scheduler on Apple Silicon.

FAQ

Will Ollama, Claude Code, and GitHub Actions together cause Swap on Mac mini?
Yes, if burst, interactive, and background workloads have no priority. See the real issue and bad schedule.

Does Ollama need to run all the time?
No. Treat local inference as schedulable: keep only nomic-embed-text by day; load large models on demand or in nightly batch.

Should I stop Ollama during CI builds?
Yes, recommended. See 30-second runbook · before CI, or full ci-pre.

Can Claude Code and Ollama run at the same time?
Yes. Coding uses the API; on-machine contention is model weights and CI peaks. Use stop-before-CI or time windows.

On Cloud Mac, is OOM just “not enough RAM”?
In setups we observe, most OOM traces to scheduling, not absolute shortage. See L2-Q02 measurements for the 16GB hard floor.

Time windows vs memory_pressure thresholds—which should I use?
Start with the 30-second runbook; add memory-guard.sh once stable (full runbook above).

How is this different from L2-Q01?
Q01 defines the Inference Service role; Q03 (Memory Scheduling Layer) explains how to schedule AI workloads on one machine.

Cloud Mac AI Stack · coming tomorrow

Claude Code + MCP: wiring GitHub, local files, and APIs into one chain

Once L2 scheduling is pinned, the next layer is L4 Context: how MCP connects GitHub, repo files, and APIs into the Claude Code workflow.

Back to blog · full Cloud Mac AI Stack series
Memory scheduling Cloud Mac pricing