I Ran Two Local LLMs on a Mini PC at Once — Benchmarks Show Why It’s Pointless

2026-05-28T00:00:00+00:00

If you’ve been following my stuff, you know I’m all about squeezing maximum value out of minimal hardware. Mini PCs, home labs, self-hosted everything. So naturally, when I got my hands on a UM790Pro with 96 GB of DDR5, my first thought was: “Can I run two LLMs simultaneously?”

The answer is yes. But the better question is: should you?

Spoiler: no. And I have the benchmarks to prove it.

The Setup

The UM790Pro is a beast for its size. Here’s what we’re working with:

CPU: AMD Ryzen 9 7940HS
GPU: AMD 780M iGPU (integrated – shares system memory)
RAM: 96 GB DDR5-5600
VRAM Pool: 2 GB dedicated + 46 GB GTT = 48 GB total GPU-accessible memory
Memory Bandwidth: ~80 GB/s (shared between CPU and iGPU)

That last point is the key to everything that follows. On a discrete GPU, the CPU and GPU have their own separate memory buses. On an APU like the 7940HS, the CPU and iGPU drink from the same straw. DDR5-5600 gives you roughly 80 GB/s, and both the CPU cores and the GPU compute units fight over every byte of it.

I’m running Ollama as my inference server. Four models in the ring:

Model	Params	Size on Disk	Max Context	Type
qwen3.6:35b	36B total (MoE – 256 experts, 8 active)	23.9 GB	262K	Mixture of Experts
gemma4-e2b-abliterated	4.6B	3.4 GB	131K	Dense
qwen3:4b-instruct	4B	2.5 GB	256K	Dense
qwen2.5:1.5b-cpu	1.5B	1.0 GB	32K	Dense

The 35B MoE model is the big gun – my daily driver for coding and complex reasoning. The smaller models were candidates for a “sidecar” role: handling quick tasks like summarization or classification while the big model crunches harder problems.

Baseline: One Model at a Time

First, I benchmarked each model running alone to get clean numbers.

Model	GPU (tok/s)	CPU (tok/s)
qwen3.6:35b	17.8	–
gemma4-e2b-abliterated	42.9	28.7
qwen3:4b-instruct	26.2	19.6
qwen2.5:1.5b-cpu	–	53.4

The 35B model at 17.8 tok/s on an iGPU is genuinely impressive. That’s usable for interactive chat. The small models are blazing fast – Gemma at 42.9 tok/s on GPU is practically instant for short responses.

Looking at these numbers, I thought: what if I keep the 35B on GPU and run a small model on CPU simultaneously? Best of both worlds, right?

The Dual-Model Experiments

I ran four combinations, firing both models at the same time with identical prompts and measuring throughput.

Test 1: Both Models on GPU

qwen3.6:35b (GPU) + gemma4-e2b (GPU)

Model	Alone	Simultaneous	Drop
qwen3.6:35b	17.8 tok/s	13.1 tok/s	-26%
gemma4-e2b	42.9 tok/s	25.3 tok/s	-41%

Both models fighting for the same GPU compute units and the same memory bus. The 35B model drops to 13.1 tok/s. Painful but expected.

Test 2: Big Model GPU + Tiny Model CPU (The “Best” Result)

qwen3.6:35b (GPU) + qwen2.5:1.5b (CPU)

Model	Alone	Simultaneous	Drop
qwen3.6:35b	17.8 tok/s	14.9 tok/s	-16%
qwen2.5:1.5b	53.4 tok/s	26.2 tok/s	-51%

This was the best result. The 1.5B model is tiny enough that its CPU inference doesn’t hammer memory bandwidth too hard. The big model only drops 16%. But the small model gets cut in half.

Test 3: Big Model GPU + Medium Model CPU-Forced

qwen3.6:35b (GPU) + gemma4-e2b (CPU, num_gpu=0)

Model	Alone	Simultaneous	Drop
qwen3.6:35b	17.8 tok/s	13.0 tok/s	-27%
gemma4-e2b	28.7 tok/s	13.4 tok/s	-53%

Forcing Gemma to CPU didn’t help. The 4.6B model doing CPU inference generates enough memory traffic to compete with the GPU’s reads. Both models suffer.

Test 4: The Worst Case – KV Cache Explosion

qwen3.6:35b (GPU) + qwen3:4b-instruct (CPU, num_gpu=0)

Model	Alone	Simultaneous	Drop
qwen3.6:35b	17.8 tok/s	11.6 tok/s	-35%
qwen3:4b-instruct	19.6 tok/s	11.1 tok/s	-43%

This was the disaster scenario. The 4B instruct model supports a 256K context window, and its KV cache ballooned to 24.2 GB at full context. Combined with the 35B model’s 32 GB VRAM allocation, we were pushing close to the system’s total memory bandwidth capacity. Both models crawled.

The Memory Architecture Problem

Here’s a diagram of what’s actually happening inside this machine:

+--------------------------------------------------+
|                  DDR5-5600 (96 GB)                |
|            ~80 GB/s shared bandwidth              |
+--------------------------------------------------+
        |                           |
   +---------+              +-----------+
   | CPU     |              | 780M iGPU |
   | Cores   |              | 12 CUs    |
   | (Zen 4) |              |           |
   +---------+              +-----------+
        |                        |
   CPU inference          GPU inference
   (model weights          (model weights
    in system RAM)          in VRAM/GTT)
        |                        |
        +------- SAME BUS ------+

The VRAM pool breaks down like this:

2 GB dedicated VRAM – physically reserved for the iGPU
46 GB GTT (Graphics Translation Table) – system RAM mapped into GPU address space
48 GB total GPU-accessible memory

When both a GPU model and a CPU model are running, they’re both streaming weights from the same DDR5 DIMMs through the same memory controller. The GPU doesn’t have its own GDDR6 with 300+ GB/s bandwidth like a discrete card. It’s sharing the same 80 GB/s pipe as everything else.

This is why the numbers get worse across the board. It’s not a compute bottleneck – it’s a memory bandwidth bottleneck.

The Real-World Conclusion

I was testing this because I wanted to run an agent framework (think: a planning model + an execution model working together). The idea was the big 35B model handles complex reasoning while a small model handles quick tool-calling or classification.

But here’s the thing: agent frameworks run tasks sequentially, not in parallel.

The planner thinks, then the executor acts, then the planner thinks again. They’re taking turns. Which means at any given moment, only one model is actually generating tokens. The other is just sitting there, loaded in memory, doing nothing – but still occupying VRAM or RAM that could be used for bigger context windows or just… not being wasted.

So the dual-model setup gives you:

Worse throughput on the big model (11-15 tok/s vs 17.8 tok/s)
No parallelism benefit in sequential agent workflows
Wasted memory keeping a second model loaded
Risk of OOM crashes (Ollama’s iGPU memory reporting has a known bug – issue #14953 – that can cause crashes with multiple loaded models)

The MoE Insight

Here’s the aha moment that made me feel silly for even trying this.

The qwen3.6:35b model is a Mixture of Experts architecture. It has 256 experts but only activates 8 per token. That means for any given token, it’s doing roughly the compute of a 4-5B parameter model while having the knowledge of a 36B parameter model.

Read that again. The big model already IS the small model in terms of per-token compute cost. MoE gives you the reasoning depth of 35B parameters with the inference speed of a much smaller model. Running a separate small model alongside it for “fast tasks” is solving a problem that doesn’t exist.

17.8 tok/s for 35B-class reasoning is already fast enough for everything I throw at it. Adding a second model only makes it slower.

Bonus: Ollama Storage Gotchas

While poking around, I found a few things worth mentioning:

Shared blobs save disk space. I had qwen3.6:35b, qwen3.6:latest, and qwen3.6:35b-nothink all listed as separate models. Turns out they all point to the same 23.9 GB blob on disk. Ollama uses content-addressed storage, so identical weights are stored once regardless of how many tags reference them.

Orphan blobs waste disk space. After deleting some models, I found a 12.9 GB orphan blob sitting in ~/.ollama/models/blobs/ that no tag referenced anymore. There’s no ollama prune command (yet), so I had to manually cross-reference blob hashes against manifest files and delete the orphan by hand. Check yours – you might be surprised.

TL;DR

Running two LLMs simultaneously on a shared-memory APU is technically possible but practically pointless:

DDR5 bandwidth (~80 GB/s) is the bottleneck, not compute
Both models compete for the same memory bus regardless of CPU vs GPU assignment
Agent frameworks run sequentially anyway – no parallel benefit
MoE models like qwen3.6:35b already give you big-model smarts at small-model speeds
Just run one model. Use the freed memory for bigger context windows instead.

Save your 96 GB of RAM for what it’s actually good at: loading one big model with a massive context window. That’s where shared-memory APUs genuinely shine.

Tested on: Minisforum UM790Pro, AMD Ryzen 9 7940HS, 96 GB DDR5-5600, Ollama v0.9.x, Ubuntu Linux

Microsoft Is Killing 3D Viewer on July 1 — Here’s What I’m Using Instead

2026-05-28T00:00:00+00:00

If you work with 3D files on Windows, you need to know this: Microsoft is permanently removing 3D Viewer from the Microsoft Store on July 1, 2026. Not deprecating it quietly. Removing it. You won’t be able to download or reinstall it after that date.

This is the last domino in Microsoft’s retreat from 3D. Paint 3D was removed in November 2024. 3D Builder is gone. Windows Mixed Reality is dead. The entire “Creators Update” era from 2017 is officially over.

Why This Matters If You 3D Print

If you’re like me — working with STL files from Thingiverse, Printables, or your own designs — 3D Viewer was the default quick-look tool on Windows. Double-click an STL, see what it looks like, decide if it’s worth slicing. Simple.

Microsoft’s suggested replacement is Babylon.js Sandbox, a browser-based viewer. Problem is, it doesn’t support STL files. The most common format in 3D printing, and their official alternative can’t open it.

What I Switched To

I’ve been using GeometryViewer for the past few months. It’s a browser-based 3D viewer that handles everything I throw at it:

STL, OBJ, GLB, GLTF, 3MF, FBX, PLY, STEP — basically every format that matters
Drag and drop, no install required
Works offline (it’s a PWA — install it once and it works without internet)
Measurement tools, cross-sections, material previews
You can share models via URL — useful when someone on Discord asks “does this model look right?”

The key thing for me: it handles STL files with proper normals and gives you a realistic material preview. I can see what a print will actually look like before I slice it.

The Full Timeline of Microsoft’s 3D Exit

For context, here’s how we got here:

Product	Status
Remix 3D (model sharing)	Shut down
Windows Mixed Reality	Deprecated Dec 2023, removed in Win11 24H2
HoloLens 2	Production stopped Oct 2024, end of life 2028
Paint 3D	Removed from Store November 4, 2024
3D Builder	Removed
3D Viewer	Removed from Store July 1, 2026

Every single component of the 2017 “3D for Everyone” initiative is now dead.

What About Existing Installs?

If you already have 3D Viewer installed, it won’t be auto-deleted. It’ll keep working. But you won’t get security patches, and if you do a clean Windows install or get a new PC, you can’t reinstall it.

There was also a serious security vulnerability (CVE-2024-20677) in 3D Viewer’s FBX parser — remote code execution with a CVSS score of 7.8. Microsoft’s fix was to permanently disable FBX support rather than fix the bug. That should tell you how much investment this app was getting.

My Recommendation

Don’t wait for July 1. Switch now and get used to a new workflow before the deadline. For 3D printing specifically, GeometryViewer is the closest thing to what 3D Viewer did, but it works in any browser and supports more formats.

If you need something heavier (textures, rigging, animation), Blender is obviously the answer — but that’s overkill for just previewing an STL before printing.

The era of Microsoft caring about 3D on the desktop is over. The good news is that browser-based tools have gotten good enough that we don’t really need them to.

Josh Green