M4 Mac mini 16GB or 24GB? One week on Ollama — I picked the wrong RAM

AI Notes  ·  2026.06.01  ·  ~15 min read  ·  includes one-week benchmark data

M4 Mac mini desktop running Ollama and Qwen3, comparing 16GB vs 24GB unified memory for local AI

I thought 16GB plus an external SSD would be enough for Ollama on a budget. Then I used the Mac mini as a real daily driver — Chrome with about twenty tabs, VS Code, Slack or Teams, and qwen3:8b always loaded. Within a week, Activity Monitor showed yellow memory pressure and swap parked around 1GB. A colleague’s 24GB box, same model, same scripts: green pressure bar.

This is not a spec-sheet repost. It is a seven-day A/B on two M4 Mac minis (16GB vs 24GB): what we measured, how we measured it, and where the numbers came from. I also map Qwen3, DeepSeek R1, and Gemma 3 for 2026, and what happens when you stack OpenHuman, Claude Code, or MLX on top of a “normal” dev desktop.

13.2
GB used (16GB · 8B)
0
GB swap (24GB · 8B)
34→37
tok/s (low vs no swap)

One-week test: same scenario with qwen3:8b

Hardware: Mac mini M4 (10-core CPU / 10-core GPU), one 16GB unit and one 24GB unit, matched OS and app versions. Window: 2026.05.26–06.01, two hours per day of mixed “dev + chat” load; each metric is the median of three consecutive readings.

Shared environment (real desktop, not a bare Ollama bench):

  • macOS 16 (26.x beta channel, same build on both machines)
  • Ollama 0.12.3 (ollama --version)
  • Google Chrome: 20 tabs (Notion, GitHub, Gmail)
  • Visual Studio Code
  • Slack or Microsoft Teams client left open

Inference setup (read numbers after 5 minutes steady state):

Primary test commands
ollama pull qwen3:8b
ollama run qwen3:8b
# In another terminal: feed a ~512-token prompt continuously; read Memory / Swap after 2 min generation

M4 Mac mini 16GB — measured

Metric Value Notes
Memory Used 13.2 GB Activity Monitor → Memory → Used
Swap Used 1.1 GB Yellow memory pressure; fan spikes occasionally
Generation speed 34 tok/s See ollama run --verbose below
Subjective feel Chrome scrolling stutters when switching back Correlates with high swap, not constant
Activity Monitor memory panel on 16GB M4 Mac mini under qwen3:8b load
Figure 1: 16GB machine memory panel (2026.05.29, qwen3:8b steady state + Chrome / VS Code / Slack). 13.2GB used, 1.1GB swap, yellow pressure. Screenshot is illustrative — match your own Activity Monitor readout.

M4 Mac mini 24GB — measured

Metric Value Notes
Memory Used 16.4 GB Same scenario, same model
Swap Used 0 GB Green memory pressure
Generation speed 37 tok/s Similar compute; gap mostly from swap
Headroom ~7.6 GB free Room for nomic-embed or a second 3B sidecar
Activity Monitor memory panel on 24GB M4 Mac mini under qwen3:8b load
Figure 2: 24GB, same scenario. 16.4GB used, 0 swap, green pressure. Raw speed is only ~9% higher, but zero swap keeps multitasking responsive. Illustrative screenshot — verify on your machine.

What “I picked wrong” actually means

16GB can run Ollama. My mistake was assuming my load was a lab single-task bench. If you mostly forward APIs and occasionally ollama run, 16GB is still fair. If your default desktop is browser + IDE + local Qwen3/DeepSeek + an agent, 24GB is insurance.

How we measured: not another AI summary article

Every table number below follows this pipeline. You can reproduce it in an afternoon on one machine:

  1. Weight sizeollama show qwen3:8b --modelfile plus *.gguf file size on disk (disk ≠ resident RAM, but sets a floor).
  2. Resident memory — After the model loads, Activity Monitor → Memory tab: record Memory Used / Swap Used / Memory Pressure (yellow vs green).
  3. Generation speed — Fixed ~512-token input prompt; read eval rate from --verbose:
tok/s measurement (source of 34 / 37 in this article)
ollama run qwen3:8b --verbose \
  "Write about 400 words explaining Apple unified memory. Use three bullet points for pros and cons."
# Average eval rate (tokens/s) from three consecutive runs

vm_stat | awk '/swap/ {print}'
memory_pressure

Not modeled here: different Ollama builds and quant tags (Q4_K_M vs Q5) shift footprint by ~0.5–1.5GB. If you swap in Gemma 3 or DeepSeek R1 tags, pull locally and re-measure. The 14B rows below use the same method.

2026 models on M4 Mac mini — footprint reference

Legend: comfortable with daily multitasking; ⚠️ runs but swap-prone / close apps; not recommended as a daily driver. Assumes Chrome + IDE load from above, not a clean bench.

Model (Ollama tag) 16GB 24GB One-week notes
Qwen3 8B qwen3:8b 16GB: ~1.1GB swap; 24GB: zero swap
DeepSeek R1 8B deepseek-r1:8b Weights ~5.2GB; curve close to Qwen3 8B
Qwen3 14B qwen3:14b ⚠️ 16GB: 2.3GB+ swap — see next section
Gemma 3 27B (quant) ⚠️ 24GB experiment only; longer context → OOM
Llama 3.2 3B Good embed / RAG sidecar

If you run the same weights through MLX, average footprint is similar to Ollama, but peaks are sharper — watch a 5-minute steady state in Activity Monitor, not just parameter count on the model card.

More benchmarks: 14B and DeepSeek (same environment)

Chrome, VS Code, and Slack/Teams stayed open; only the Ollama model changed:

Model RAM Memory used Swap tok/s
deepseek-r1:8b 16GB 13.5 GB 1.0 GB 33
deepseek-r1:8b 24GB 16.6 GB 0 36
qwen3:14b 16GB 15.8 GB 2.3 GB 18
qwen3:14b 24GB 19.1 GB 0 28

Bottom line: if Qwen3 or DeepSeek 14B is your daily model, 16GB fights swap constantly; 24GB keeps the browser and IDE open without a “close everything” ritual.

Theory vs measurement: where “14B Q4 ≈ 8–10GB” comes from

Blog posts that only cite “8–10GB” without sources read like LLM summaries. Here is the stack, aligned to qwen3:14b above:

  • Weight files — After ollama pull qwen3:14b, local GGUF is about 8.4–9.2GB depending on quant tag. Loaded resident RAM is usually slightly below file size (mmap + shared pages), not an order of magnitude smaller.
  • KV cache — At 8k context and batch=1, expect another 1–3GB; raising OLLAMA_CONTEXT_LENGTH pushes 16GB machines over the edge faster.
  • Measured total — Same desktop, qwen3:14b on 24GB: 19.1GB used = model-related footprint + KV + Chrome/IDE/system (~5–6GB). Model-only back-of-envelope is ~13–14GB, matching “8–10GB weights + a few GB runtime.”

So 14B Q4 can run, but not alongside a full dev desktop by default — upgrade to 24GB, or close tabs, shrink context, and move RAG embed jobs elsewhere.

Stacking Claude Code and OpenHuman: budget extra RAM

During the week I also tried two common 2026 “dev + local model” combos:

  • Claude Code + local Ollama — Terminal agent on Anthropic API, Ollama for offline drafts or sensitive snippets. VS Code / Cursor adds 1–2GB plus Chrome; 16GB leaves almost no 14B headroom.
  • OpenHuman + qwen3:8b — Desktop agent and Memory Tree sync hold ~1GB+ in the background; see our OpenHuman install guide. On 24GB I could keep OAuth sync and 8B without swap.
  • MLX — With Xcode / Core ML pipelines, compile peaks can briefly max memory; good fit for a dedicated Cloud Mac node while a 16GB laptop stays the dev UI.

Large repos with CodeGraph + Claude Code MCP do not fill RAM by themselves, but you will resist closing Chrome — which nudges the machine toward 24GB (see CodeGraph local setup).

Buying decision (from measurements, not the spec sheet)

  • Choose 16GB — Local 8B class only (Qwen3 / DeepSeek R1), OK with swap and occasional tab closing; or cloud Claude/GPT APIs with Ollama as backup.
  • Choose 24GB — Default 14B + browser + IDE + agent; headroom for Gemma 3 27B experiments; planning to keep one machine 3–5 years without RAM as the first bottleneck.

This pairs with GPU cloud cost comparison: RAM tier answers “is one Mac comfortable,” cloud answers “24/7 and static IP.”

M4 Mac mini vs RTX 5060 desktop — how to choose

Most “local AI” searches are really comparing Apple unified memory (Mac mini / Mac Studio) vs NVIDIA VRAM (RTX 5060 build). ZavCloud sells Cloud Mac — dedicated macOS on M4 Mac mini — so this is not “never buy a GPU.” It is where each platform wins.

Mac Studio raises the RAM ceiling (64GB+); AWS / cloud GPU fits 70B full weights, training, and Stable Diffusion batches. The tables assume the same Chrome + IDE desktop as above.

Where M4 Mac mini wins

Scenario M4 Mac mini Why
iOS / macOS development Xcode, TestFlight, device debugging — no RTX substitute
Claude Code / Cursor Terminal agent + local Ollama drafts; unified memory avoids “VRAM OOM”
Local AI (8B–14B text) Qwen3 / DeepSeek R1 8B–14B; 24GB can run 14B with zero swap
OpenHuman / MLX / Core ML Apple-stack inference; see Core ML Cloud Mac
AAA gaming / CUDA training Not the design target

Where RTX 5060 wins

Scenario RTX 5060 PC Why
PC gaming Discrete GPU + Windows ecosystem
Stable Diffusion / ComfyUI CUDA plugins and model hub; Mac works but ecosystem is thinner
70B class models (quant) 12GB VRAM + system RAM stacking; 24GB Mac mini only ⚠️ for 27B trials
Multi-GPU / training Upgrade path to 5070 Ti, dual GPU, or cloud GPU; no CUDA on Mac
App Store shipping Still need a Mac; common pattern: RTX for images + Cloud Mac for signing

Hybrid setup (what we see most among Cloud Mac customers)

Local or Cloud Mac (24GB) for Ollama / Claude Code / iOS; RTX or cloud GPU for SD and 70B. If a 16GB Mac mini is swap-bound, move 14B to a 24GB Cloud Mac for a week before buying more hardware. Billing context: M4 inference vs GPU cloud.

Try before you buy: run your load for a week

If you are still torn between 16GB and 24GB, this sequence worked better for me than guessing from Apple’s configurator:

  1. List software you will not close (tab count, IDE, Slack/Teams, agents).
  2. Run Qwen3 8B and 14B for 30 minutes each with the commands above; log swap and tok/s.
  3. If 14B keeps swap above 1GB, drop 16GB from the short list.

No hardware yet? Rent the same Apple Silicon config in the cloud, deploy your Ollama workflow, knowledge base, and agents, and watch memory curves for a week — usually cheaper than guessing RAM at checkout.

We operate M4 Mac mini Cloud Mac hosts daily; a frequent pattern is 16GB laptop for dev + 24GB cloud for Ollama / OpenHuman always-on. ZavCloud offers dedicated M4 Mac mini instances (native macOS, static IP) for pre-purchase stress tests, not to replace your judgment. Details: Cloud Mac plans.

FAQ

Answers match the measurement tables above so you can reproduce them side by side.

How much RAM does Qwen3 14B need?

With the same desktop load as this article (~20 Chrome tabs, VS Code, Slack/Teams), qwen3:14b steady state is ~19.1GB used on 24GB; on 16GB you see ~15.8GB used plus 2.3GB swap. Model-only: GGUF weights ~8.4–9.2GB plus KV and OS overhead — practical minimum is 24GB unified memory; 16GB only works if you close the browser and lower OLLAMA_CONTEXT_LENGTH.

Can DeepSeek R1 14B run?

deepseek-r1:8b is fine on 16GB (swap ~1GB, 33 tok/s), same class as Qwen3 8B. 14B tier tracks qwen3:14b: 24GB → zero swap (~28 tok/s); 16GB swaps hard and drops toward ~18 tok/s. It runs, but 14B as a daily driver wants 24GB.

Is 16GB becoming obsolete?

Not overnight. In 2026 the default load is browser + IDE + resident 8B + an agent, not occasional ollama run. 16GB still fits 8B inference, API-first workflows, willing tab discipline. If you live in Claude Code / OpenHuman without “close everything” days, 16GB feels like entry tier, not comfort tier.

How many years will 24GB last?

At today’s 8B–14B main models (27B for experiments), ~3–5 years is still the sweet spot for desktop + local agent: room for 14B, an embed model, and dev tools. Beyond that (32B+ resident) plan Mac Studio or cloud GPU — Mac mini RAM is not upgradeable after purchase.

Ollama vs MLX — which uses less memory?

Daily chat: Ollama is easier to reason about. GGUF + resident process footprint matches Activity Monitor in this article. MLX spikes higher when loading, compiling, or running beside Xcode — average may be similar, but swap spikes hurt more. Default Ollama for chat/RAG; MLX for batch jobs or a dedicated node.

Mac mini vs RTX 5060 for local AI?

Text 8B–14B, Claude Code, iOS: M4 Mac mini (24GB recommended). Unified memory and macOS tooling are the moat. Stable Diffusion, 70B quant, gaming: RTX 5060. CUDA ecosystem is not replaceable on Mac mini. Most teams end up with both, not either/or.

Is 16GB enough for Qwen3 8B?

Yes. We measured 13.2GB used, ~1.1GB swap, 34 tok/s with yellow pressure. Accept occasional stutter → 16GB; want zero-swap desktop → 24GB (16.4GB used, 37 tok/s).

How much faster is 24GB for Ollama?

Same qwen3:8b: only ~9% (34 vs 37 tok/s). The win is zero swap, stacking 14B, and a second small model — not 2× raw speed.

Can Gemma 3 27B run on 24GB Mac mini?

⚠️ Heavy quant experiments only; longer context tends to OOM. 8B–14B is the 24GB comfort zone; 27B as a daily driver belongs on RTX or cloud GPU.

Is Mac Studio worth it for AI?

For 8B–14B plus development, M4 Mac mini 24GB is better value. Move to Mac Studio when you need 64GB unified memory, multiple 14B/32B streams, or heavier MLX pipelines — or validate curves on Cloud Mac first.

Not sure which RAM tier to buy?

Run 8B and 14B for 30 minutes each with the commands above and watch swap. No machine yet? Rent a matching M4 Mac mini Cloud Mac (24GB), deploy Ollama / OpenHuman, observe a week, then buy metal — usually cheaper than guessing or buying RTX first.

Can I reproduce your numbers?

Yes. On Ollama 0.12.x with the same Chrome/IDE/Slack scenario, absolute values may differ ±10%, but 16GB swaps, 24GB does not, 14B wants 24GB should hold.

Field notes

Not sure on RAM? Run your load for a week first

Mirror this article: Chrome, IDE, qwen3:8b / qwen3:14b, log swap and memory pressure. For a production-like dedicated macOS environment, ZavCloud M4 Mac mini Cloud Mac is built for pre-purchase validation.

View Cloud Mac plans
Cloud Mac Mac cloud hosting