M4 / M5 Apple Silicon Is Becoming an AI Compute Platform—Not Just a Faster Chip

Apple Silicon is moving from "personal computer" to "schedulable AI node." The M4/M5 shift is not about Geekbench—it is about how workloads stack. On one Mac mini, Ollama, Claude Code, and a GitHub Runner compete for the same unified memory pool; whoever fills it first decides whether the machine feels fast or stuck.

On M4 we keep seeing the same pattern: when memory starts swapping, Ollama drops from ~37 to ~34 tok/s; self-hosted xcodebuild test drifts from 12 to 19 minutes—often with CPU not maxed but the memory pressure bar already yellow. Below: three sizing questions and a simple pressure estimate to choose upgrade M4, wait for M5, or rent Cloud Mac. Deeper benchmarks and rental scenarios link to dedicated posts.

One diagram: how AI workloads crush unified memory

Human input commit, Run, open PR

triggers

Interaction · IDE / Claude Code local Mac · memory spikes

stacks

Execution · Runner / CI xcodebuild burst · +4–8GB

stacks

LLM background · Ollama resident 7B–14B · embeddings stay loaded

feeds into

Unified memory · shared pool CPU / GPU / NPU one floor · bottleneck here

exceeds headroom

Swap · degradation signal not "too slow CPU"—failed memory scheduling

shows up as

tok/s ↓ · CI wall time ↑ e.g. 37→34 tok/s · 12→19 min

Healthy path (scheduling / split nodes)

You → IDE coding
Runner on Cloud Mac for CI
Ollama overnight or on another machine
Headroom left → OK

Degradation path (all three layers online)

Resident LLM ↑
Runner burst ↑
Memory use ↑
Swap ↑
Slower CI · slower generation

Core idea: performance pain is often a memory scheduling problem—not raw compute. Every step pours into the same pool.

Left: how events push unified memory past the cliff; right: scheduled vs unscheduled. The upgrade pressure formula below measures whether this chain has entered swap.

The three sizing questions below are really asking: has this chain already started swapping?

Start from your question

This post covers how to choose. Jump directly if you already know your bottleneck:

You are asking…	Read this
M4/M5 generational change, upgrade timing, workload split	This post
How fast is Ollama 7B/14B? How much does swap hurt tok/s?	M4 Mac mini Ollama benchmark · 16GB vs 24GB
Ollama + Runner together feel sluggish—how to schedule?	AI workload scheduling runbook
Rent Cloud Mac to validate, or wait for M5 / buy hardware?	Cloud Mac vs waiting for M5（6/9 发布） · Cloud Mac vs local Mac

34→37

tok/s (16GB swap vs 24GB zero swap)

12→19

minutes (Runner slowed by swap)

1.1GB

swap (qwen3:8b resident · 16GB)

What M4 changed: not a faster Mac—a node that can run AI tasks all day

M4 is not "CPU a bit faster"—it is the first Mac mini that can keep local inference loaded in a normal dev desktop. ~38 TOPS Neural Engine shares unified memory with CPU/GPU—Chrome + VS Code + resident qwen3:8b is already daily life (see 16GB vs 24GB numbers).

You can verify in the OS: memory_pressure, Activity Monitor swap curve, Ollama footprint—they answer can this machine hold CI peaks and a resident LLM at once, not which "M" badge it wears.

For engineers the practical question moved from "does IDE lag?" to three measurable signals: Ollama tok/s, swap appearing or not, CI wall time drifting.

Three sizing questions (skip Geekbench alone)

Treating M1→M5 as a benchmark ladder buys the wrong box. Each question maps to the causal chain—and each asks: has swap shown up yet?

Axis	Question	Verify on M4
Compute	Is tok/s enough?	16GB with swap ~34 tok/s; 24GB zero swap ~37 (same model/script)
Memory	Does swap trigger?	16GB resident 8B: 1.1GB swap, yellow pressure; 24GB: 0 swap, green
Parallelism	Can Runner and LLM run together?	`xcodebuild` burst +4–8GB; stacked with Ollama → swap (see scheduling runbook)

Generational gap on M4/M5 is really when swap appears—not abstract "faster." Enough tok/s but frequent swap still feels like a slow machine; enough RAM and sane scheduling makes the same chip a stable AI node.

Should you upgrade? A simple pressure estimate

Plug your measured values (rate impact on a 1–5 rough scale is fine):

upgrade pressure ≈

  ( how often swap appears × impact on CI slowdown )
+ ( resident models at once × memory each uses )
− ( headroom you still have )

What it means: the formula targets the bottom of the causal chain—once unified memory is swap-bound, every layer slows. Clearly > 0 means RAM and scheduling, not another 16GB M4 or waiting for M5, will fix Runner drift.

How to read the result:

Clearly > 0 — add headroom first: 24GB RAM, stop Ollama before CI, or add Cloud Mac to split Runner and inference.
Near 0 — hold steady; log numbers in team docs; re-check in a few weeks.
< 0 but tok/s still low — likely pure compute bound; watch M5 numbers—but do not assume "next gen fixes it" while swap is still nonzero.

Our site data: 16GB with 1.1GB swap, Runner 12→19 min → pressure clearly > 0; another M4 16GB is not enough—need 24GB or scheduling (ollama stop before CI). 24GB same scene: 37 tok/s, 0 swap → near 0 unless you add a second 14B resident model.

How to split local Mac and Cloud Mac

Cloud Mac is not remote desktop—it is a macOS node for 24/7 builds and inference. Match the healthy path on the diagram: interaction stays local; execution and LLM background move to cloud or off-peak slots:

Where	Runs on	Typical tasks
Local Mac	Laptop / desktop	Code, review, Claude Code human-in-the-loop
Cloud Mac	Dedicated Mac mini, 24/7	Self-hosted GitHub Runner, Xcode build, signing, TestFlight
Cloud Mac or off-peak	Night / dedicated node	Ollama / MLX inference, embedding batches

Local owns "human in the loop"; cloud owns "keeps running after the lid closes." Rental vs purchase: Mac mini for AI dev: why Cloud Mac beats waiting for M5（6/9 发布）; Ollama as a long-running service: Ollama on Cloud Mac.

30-second self-check on your Mac

Run on the machine you are evaluating—record output so the team aligns on facts, not vibes:

# Chip and unified memory
sysctl -n machdep.cpu.brand_string
system_profiler SPHardwareDataType | grep "Memory:"

# Swap and Ollama footprint
ollama ps
memory_pressure
vm_stat | grep "Pageouts"

# Runner latency (CI log or local timer)
# xcodebuild test wall time: 12 min before swap → 19 min after (same repo)

Optional tok/s baseline (same script as 16GB vs 24GB post):

python3 -m mlx_lm.generate \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --prompt "Summarize Apple Silicon unified memory in 3 bullets." \
  --max-tokens 128
# Record: tok/s, Memory Used, Swap Used

If Pageouts climb while Ollama stays resident and Runner wall time drifts >30%, fix scheduling and RAM tier before blaming chip generation.

Is M5 worth waiting for?

M5 is not mainstream stock yet—do not treat it as "buy once, solved forever." The realistic bet: the industry is moving toward larger unified memory and bandwidth, which may delay swap—but default RAM tiers and your script gains still need the same commands re-run on shipping hardware.

Until M5 numbers exist, keep deciding on M4 tok/s, swap, and Runner time. CI boxes often lag laptops by 1–2 chip generations—renting or buying M4 for AI dev in 2026–2027 remains pragmatic (cost vs GPU cloud: M4 vs GPU cloud).

Pitfall: enough speed, bad scheduling

A SaaS team ran Claude Code + self-hosted Runner on M2 16GB, upgraded to M4 16GB expecting "one generation = stable." When nightly Ollama embedding jobs kicked in, xcodebuild test drifted 12→19 minutes—Activity Monitor showed yellow memory pressure + sustained swap while CPU stayed moderate.

Remember this

The chip was not too slow—the workloads were not scheduled. M4 does not auto-stop Ollama before CI or invent extra RAM.

Root cause: treating "upgrade to M4" as "got more memory." Fix: 24GB, or split Ollama and Runner across machines / time slots (parallel scheduling post). When M5 ships, teams that only compare CPU generations without scheduling will hit the same wall.

FAQ

Upgrade to M4 or wait for M5? Check swap and Runner first. Frequent swap or CI drift → 24GB, scheduling, or Cloud Mac. Zero swap but generation still slow → watch M5 hardware. Do not substitute Geekbench for these checks.

Is Mac mini good for AI dev? Yes for 7B–14B local inference, Core ML, agents + CI. 70B training still belongs on GPU cloud.

Cloud Mac vs buying hardware? Hardware for daily coding; Cloud Mac for 24/7 Runner, nightly batch inference, and "run the pipeline before choosing RAM tier"—daily rental often beats waiting for a keynote.

M4 Ollama: 7B / 14B tok/s and swap
16GB vs 24GB one-week test
Ollama + Runner parallel scheduling
AI dev: Cloud Mac vs waiting for M5（6/9 发布）

ZavCloud

Measure swap and CI time before upgrading or renting

Dedicated Mac mini M4, native macOS, static IPv4—run the same self-check locally or in the cloud, then choose hardware or daily rental.

View Cloud Mac plans

M4 / M5 Apple Silicon is shifting from a performance chip to an AI compute platform