<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://joshgreen-dev.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://joshgreen-dev.github.io/" rel="alternate" type="text/html" /><updated>2026-05-29T14:40:02+00:00</updated><id>https://joshgreen-dev.github.io/feed.xml</id><title type="html">Josh Green</title><subtitle>Web developer, 3D printing, AI tinkerer</subtitle><entry><title type="html">I Ran Two Local LLMs on a Mini PC at Once — Benchmarks Show Why It’s Pointless</title><link href="https://joshgreen-dev.github.io/2026/05/28/dual-llm-benchmarks-mini-pc.html" rel="alternate" type="text/html" title="I Ran Two Local LLMs on a Mini PC at Once — Benchmarks Show Why It’s Pointless" /><published>2026-05-28T00:00:00+00:00</published><updated>2026-05-28T00:00:00+00:00</updated><id>https://joshgreen-dev.github.io/2026/05/28/dual-llm-benchmarks-mini-pc</id><content type="html" xml:base="https://joshgreen-dev.github.io/2026/05/28/dual-llm-benchmarks-mini-pc.html"><![CDATA[<hr />

<p>If you’ve been following my stuff, you know I’m all about squeezing maximum value out of minimal hardware. Mini PCs, home labs, self-hosted everything. So naturally, when I got my hands on a UM790Pro with 96 GB of DDR5, my first thought was: “Can I run two LLMs simultaneously?”</p>

<p>The answer is yes. But the better question is: <em>should</em> you?</p>

<p>Spoiler: no. And I have the benchmarks to prove it.</p>

<h2 id="the-setup">The Setup</h2>

<p>The UM790Pro is a beast for its size. Here’s what we’re working with:</p>

<ul>
  <li><strong>CPU:</strong> AMD Ryzen 9 7940HS</li>
  <li><strong>GPU:</strong> AMD 780M iGPU (integrated – shares system memory)</li>
  <li><strong>RAM:</strong> 96 GB DDR5-5600</li>
  <li><strong>VRAM Pool:</strong> 2 GB dedicated + 46 GB GTT = 48 GB total GPU-accessible memory</li>
  <li><strong>Memory Bandwidth:</strong> ~80 GB/s (shared between CPU and iGPU)</li>
</ul>

<p>That last point is the key to everything that follows. On a discrete GPU, the CPU and GPU have their own separate memory buses. On an APU like the 7940HS, the CPU and iGPU drink from the same straw. DDR5-5600 gives you roughly 80 GB/s, and both the CPU cores and the GPU compute units fight over every byte of it.</p>

<p>I’m running Ollama as my inference server. Four models in the ring:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Params</th>
      <th>Size on Disk</th>
      <th>Max Context</th>
      <th>Type</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>qwen3.6:35b</td>
      <td>36B total (MoE – 256 experts, 8 active)</td>
      <td>23.9 GB</td>
      <td>262K</td>
      <td>Mixture of Experts</td>
    </tr>
    <tr>
      <td>gemma4-e2b-abliterated</td>
      <td>4.6B</td>
      <td>3.4 GB</td>
      <td>131K</td>
      <td>Dense</td>
    </tr>
    <tr>
      <td>qwen3:4b-instruct</td>
      <td>4B</td>
      <td>2.5 GB</td>
      <td>256K</td>
      <td>Dense</td>
    </tr>
    <tr>
      <td>qwen2.5:1.5b-cpu</td>
      <td>1.5B</td>
      <td>1.0 GB</td>
      <td>32K</td>
      <td>Dense</td>
    </tr>
  </tbody>
</table>

<p>The 35B MoE model is the big gun – my daily driver for coding and complex reasoning. The smaller models were candidates for a “sidecar” role: handling quick tasks like summarization or classification while the big model crunches harder problems.</p>

<h2 id="baseline-one-model-at-a-time">Baseline: One Model at a Time</h2>

<p>First, I benchmarked each model running alone to get clean numbers.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>GPU (tok/s)</th>
      <th>CPU (tok/s)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>qwen3.6:35b</td>
      <td>17.8</td>
      <td>–</td>
    </tr>
    <tr>
      <td>gemma4-e2b-abliterated</td>
      <td>42.9</td>
      <td>28.7</td>
    </tr>
    <tr>
      <td>qwen3:4b-instruct</td>
      <td>26.2</td>
      <td>19.6</td>
    </tr>
    <tr>
      <td>qwen2.5:1.5b-cpu</td>
      <td>–</td>
      <td>53.4</td>
    </tr>
  </tbody>
</table>

<p>The 35B model at 17.8 tok/s on an iGPU is genuinely impressive. That’s usable for interactive chat. The small models are blazing fast – Gemma at 42.9 tok/s on GPU is practically instant for short responses.</p>

<p>Looking at these numbers, I thought: what if I keep the 35B on GPU and run a small model on CPU simultaneously? Best of both worlds, right?</p>

<h2 id="the-dual-model-experiments">The Dual-Model Experiments</h2>

<p>I ran four combinations, firing both models at the same time with identical prompts and measuring throughput.</p>

<h3 id="test-1-both-models-on-gpu">Test 1: Both Models on GPU</h3>

<p><strong>qwen3.6:35b (GPU) + gemma4-e2b (GPU)</strong></p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Alone</th>
      <th>Simultaneous</th>
      <th>Drop</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>qwen3.6:35b</td>
      <td>17.8 tok/s</td>
      <td>13.1 tok/s</td>
      <td>-26%</td>
    </tr>
    <tr>
      <td>gemma4-e2b</td>
      <td>42.9 tok/s</td>
      <td>25.3 tok/s</td>
      <td>-41%</td>
    </tr>
  </tbody>
</table>

<p>Both models fighting for the same GPU compute units and the same memory bus. The 35B model drops to 13.1 tok/s. Painful but expected.</p>

<h3 id="test-2-big-model-gpu--tiny-model-cpu-the-best-result">Test 2: Big Model GPU + Tiny Model CPU (The “Best” Result)</h3>

<p><strong>qwen3.6:35b (GPU) + qwen2.5:1.5b (CPU)</strong></p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Alone</th>
      <th>Simultaneous</th>
      <th>Drop</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>qwen3.6:35b</td>
      <td>17.8 tok/s</td>
      <td>14.9 tok/s</td>
      <td>-16%</td>
    </tr>
    <tr>
      <td>qwen2.5:1.5b</td>
      <td>53.4 tok/s</td>
      <td>26.2 tok/s</td>
      <td>-51%</td>
    </tr>
  </tbody>
</table>

<p>This was the best result. The 1.5B model is tiny enough that its CPU inference doesn’t hammer memory bandwidth too hard. The big model only drops 16%. But the small model gets cut in half.</p>

<h3 id="test-3-big-model-gpu--medium-model-cpu-forced">Test 3: Big Model GPU + Medium Model CPU-Forced</h3>

<p><strong>qwen3.6:35b (GPU) + gemma4-e2b (CPU, num_gpu=0)</strong></p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Alone</th>
      <th>Simultaneous</th>
      <th>Drop</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>qwen3.6:35b</td>
      <td>17.8 tok/s</td>
      <td>13.0 tok/s</td>
      <td>-27%</td>
    </tr>
    <tr>
      <td>gemma4-e2b</td>
      <td>28.7 tok/s</td>
      <td>13.4 tok/s</td>
      <td>-53%</td>
    </tr>
  </tbody>
</table>

<p>Forcing Gemma to CPU didn’t help. The 4.6B model doing CPU inference generates enough memory traffic to compete with the GPU’s reads. Both models suffer.</p>

<h3 id="test-4-the-worst-case--kv-cache-explosion">Test 4: The Worst Case – KV Cache Explosion</h3>

<p><strong>qwen3.6:35b (GPU) + qwen3:4b-instruct (CPU, num_gpu=0)</strong></p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Alone</th>
      <th>Simultaneous</th>
      <th>Drop</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>qwen3.6:35b</td>
      <td>17.8 tok/s</td>
      <td>11.6 tok/s</td>
      <td>-35%</td>
    </tr>
    <tr>
      <td>qwen3:4b-instruct</td>
      <td>19.6 tok/s</td>
      <td>11.1 tok/s</td>
      <td>-43%</td>
    </tr>
  </tbody>
</table>

<p>This was the disaster scenario. The 4B instruct model supports a 256K context window, and its KV cache ballooned to 24.2 GB at full context. Combined with the 35B model’s 32 GB VRAM allocation, we were pushing close to the system’s total memory bandwidth capacity. Both models crawled.</p>

<h2 id="the-memory-architecture-problem">The Memory Architecture Problem</h2>

<p>Here’s a diagram of what’s actually happening inside this machine:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+--------------------------------------------------+
|                  DDR5-5600 (96 GB)                |
|            ~80 GB/s shared bandwidth              |
+--------------------------------------------------+
        |                           |
   +---------+              +-----------+
   | CPU     |              | 780M iGPU |
   | Cores   |              | 12 CUs    |
   | (Zen 4) |              |           |
   +---------+              +-----------+
        |                        |
   CPU inference          GPU inference
   (model weights          (model weights
    in system RAM)          in VRAM/GTT)
        |                        |
        +------- SAME BUS ------+
</code></pre></div></div>

<p>The VRAM pool breaks down like this:</p>

<ul>
  <li><strong>2 GB dedicated VRAM</strong> – physically reserved for the iGPU</li>
  <li><strong>46 GB GTT (Graphics Translation Table)</strong> – system RAM mapped into GPU address space</li>
  <li><strong>48 GB total GPU-accessible memory</strong></li>
</ul>

<p>When both a GPU model and a CPU model are running, they’re both streaming weights from the same DDR5 DIMMs through the same memory controller. The GPU doesn’t have its own GDDR6 with 300+ GB/s bandwidth like a discrete card. It’s sharing the same 80 GB/s pipe as everything else.</p>

<p>This is why the numbers get worse across the board. It’s not a compute bottleneck – it’s a memory bandwidth bottleneck.</p>

<h2 id="the-real-world-conclusion">The Real-World Conclusion</h2>

<p>I was testing this because I wanted to run an agent framework (think: a planning model + an execution model working together). The idea was the big 35B model handles complex reasoning while a small model handles quick tool-calling or classification.</p>

<p>But here’s the thing: <strong>agent frameworks run tasks sequentially, not in parallel.</strong></p>

<p>The planner thinks, then the executor acts, then the planner thinks again. They’re taking turns. Which means at any given moment, only one model is actually generating tokens. The other is just sitting there, loaded in memory, doing nothing – but still occupying VRAM or RAM that could be used for bigger context windows or just… not being wasted.</p>

<p>So the dual-model setup gives you:</p>

<ol>
  <li><strong>Worse throughput</strong> on the big model (11-15 tok/s vs 17.8 tok/s)</li>
  <li><strong>No parallelism benefit</strong> in sequential agent workflows</li>
  <li><strong>Wasted memory</strong> keeping a second model loaded</li>
  <li><strong>Risk of OOM crashes</strong> (Ollama’s iGPU memory reporting has a known bug – <a href="https://github.com/ollama/ollama/issues/14953">issue #14953</a> – that can cause crashes with multiple loaded models)</li>
</ol>

<h2 id="the-moe-insight">The MoE Insight</h2>

<p>Here’s the aha moment that made me feel silly for even trying this.</p>

<p>The qwen3.6:35b model is a Mixture of Experts architecture. It has 256 experts but only activates 8 per token. That means for any given token, it’s doing roughly the compute of a 4-5B parameter model while having the knowledge of a 36B parameter model.</p>

<p>Read that again. The big model already IS the small model in terms of per-token compute cost. MoE gives you the reasoning depth of 35B parameters with the inference speed of a much smaller model. Running a separate small model alongside it for “fast tasks” is solving a problem that doesn’t exist.</p>

<p>17.8 tok/s for 35B-class reasoning is already fast enough for everything I throw at it. Adding a second model only makes it slower.</p>

<h2 id="bonus-ollama-storage-gotchas">Bonus: Ollama Storage Gotchas</h2>

<p>While poking around, I found a few things worth mentioning:</p>

<p><strong>Shared blobs save disk space.</strong> I had <code class="language-plaintext highlighter-rouge">qwen3.6:35b</code>, <code class="language-plaintext highlighter-rouge">qwen3.6:latest</code>, and <code class="language-plaintext highlighter-rouge">qwen3.6:35b-nothink</code> all listed as separate models. Turns out they all point to the same 23.9 GB blob on disk. Ollama uses content-addressed storage, so identical weights are stored once regardless of how many tags reference them.</p>

<p><strong>Orphan blobs waste disk space.</strong> After deleting some models, I found a 12.9 GB orphan blob sitting in <code class="language-plaintext highlighter-rouge">~/.ollama/models/blobs/</code> that no tag referenced anymore. There’s no <code class="language-plaintext highlighter-rouge">ollama prune</code> command (yet), so I had to manually cross-reference blob hashes against manifest files and delete the orphan by hand. Check yours – you might be surprised.</p>

<h2 id="tldr">TL;DR</h2>

<p>Running two LLMs simultaneously on a shared-memory APU is technically possible but practically pointless:</p>

<ul>
  <li>DDR5 bandwidth (~80 GB/s) is the bottleneck, not compute</li>
  <li>Both models compete for the same memory bus regardless of CPU vs GPU assignment</li>
  <li>Agent frameworks run sequentially anyway – no parallel benefit</li>
  <li>MoE models like qwen3.6:35b already give you big-model smarts at small-model speeds</li>
  <li>Just run one model. Use the freed memory for bigger context windows instead.</li>
</ul>

<p>Save your 96 GB of RAM for what it’s actually good at: loading one big model with a massive context window. That’s where shared-memory APUs genuinely shine.</p>

<hr />

<p><em>Tested on: Minisforum UM790Pro, AMD Ryzen 9 7940HS, 96 GB DDR5-5600, Ollama v0.9.x, Ubuntu Linux</em></p>]]></content><author><name></name></author><category term="ai" /><category term="llm" /><category term="minipc" /><category term="selfhosted" /><summary type="html"><![CDATA[Dual-model benchmarks on a 96GB UM790Pro. The shared DDR5 bus ruins everything, but MoE saves the day.]]></summary></entry><entry><title type="html">Microsoft Is Killing 3D Viewer on July 1 — Here’s What I’m Using Instead</title><link href="https://joshgreen-dev.github.io/2026/05/28/microsoft-kills-3d-viewer-what-now.html" rel="alternate" type="text/html" title="Microsoft Is Killing 3D Viewer on July 1 — Here’s What I’m Using Instead" /><published>2026-05-28T00:00:00+00:00</published><updated>2026-05-28T00:00:00+00:00</updated><id>https://joshgreen-dev.github.io/2026/05/28/microsoft-kills-3d-viewer-what-now</id><content type="html" xml:base="https://joshgreen-dev.github.io/2026/05/28/microsoft-kills-3d-viewer-what-now.html"><![CDATA[<p>If you work with 3D files on Windows, you need to know this: Microsoft is permanently removing 3D Viewer from the Microsoft Store on <strong>July 1, 2026</strong>. Not deprecating it quietly. Removing it. You won’t be able to download or reinstall it after that date.</p>

<p>This is the last domino in Microsoft’s retreat from 3D. Paint 3D was removed in November 2024. 3D Builder is gone. Windows Mixed Reality is dead. The entire “Creators Update” era from 2017 is officially over.</p>

<h2 id="why-this-matters-if-you-3d-print">Why This Matters If You 3D Print</h2>

<p>If you’re like me — working with STL files from Thingiverse, Printables, or your own designs — 3D Viewer was the default quick-look tool on Windows. Double-click an STL, see what it looks like, decide if it’s worth slicing. Simple.</p>

<p>Microsoft’s suggested replacement is Babylon.js Sandbox, a browser-based viewer. Problem is, <strong>it doesn’t support STL files</strong>. The most common format in 3D printing, and their official alternative can’t open it.</p>

<h2 id="what-i-switched-to">What I Switched To</h2>

<p>I’ve been using <a href="https://geometryviewer.com">GeometryViewer</a> for the past few months. It’s a browser-based 3D viewer that handles everything I throw at it:</p>

<ul>
  <li><strong>STL, OBJ, GLB, GLTF, 3MF, FBX, PLY, STEP</strong> — basically every format that matters</li>
  <li>Drag and drop, no install required</li>
  <li>Works offline (it’s a PWA — install it once and it works without internet)</li>
  <li>Measurement tools, cross-sections, material previews</li>
  <li>You can share models via URL — useful when someone on Discord asks “does this model look right?”</li>
</ul>

<p>The key thing for me: it handles STL files with proper normals and gives you a realistic material preview. I can see what a print will actually look like before I slice it.</p>

<h2 id="the-full-timeline-of-microsofts-3d-exit">The Full Timeline of Microsoft’s 3D Exit</h2>

<p>For context, here’s how we got here:</p>

<table>
  <thead>
    <tr>
      <th>Product</th>
      <th>Status</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Remix 3D (model sharing)</td>
      <td>Shut down</td>
    </tr>
    <tr>
      <td>Windows Mixed Reality</td>
      <td>Deprecated Dec 2023, removed in Win11 24H2</td>
    </tr>
    <tr>
      <td>HoloLens 2</td>
      <td>Production stopped Oct 2024, end of life 2028</td>
    </tr>
    <tr>
      <td>Paint 3D</td>
      <td>Removed from Store November 4, 2024</td>
    </tr>
    <tr>
      <td>3D Builder</td>
      <td>Removed</td>
    </tr>
    <tr>
      <td><strong>3D Viewer</strong></td>
      <td><strong>Removed from Store July 1, 2026</strong></td>
    </tr>
  </tbody>
</table>

<p>Every single component of the 2017 “3D for Everyone” initiative is now dead.</p>

<h2 id="what-about-existing-installs">What About Existing Installs?</h2>

<p>If you already have 3D Viewer installed, it won’t be auto-deleted. It’ll keep working. But you won’t get security patches, and if you do a clean Windows install or get a new PC, you can’t reinstall it.</p>

<p>There was also a serious security vulnerability (CVE-2024-20677) in 3D Viewer’s FBX parser — remote code execution with a CVSS score of 7.8. Microsoft’s fix was to permanently disable FBX support rather than fix the bug. That should tell you how much investment this app was getting.</p>

<h2 id="my-recommendation">My Recommendation</h2>

<p>Don’t wait for July 1. Switch now and get used to a new workflow before the deadline. For 3D printing specifically, <a href="https://geometryviewer.com">GeometryViewer</a> is the closest thing to what 3D Viewer did, but it works in any browser and supports more formats.</p>

<p>If you need something heavier (textures, rigging, animation), Blender is obviously the answer — but that’s overkill for just previewing an STL before printing.</p>

<p>The era of Microsoft caring about 3D on the desktop is over. The good news is that browser-based tools have gotten good enough that we don’t really need them to.</p>]]></content><author><name></name></author><category term="3dprinting" /><category term="windows" /><category term="tools" /><summary type="html"><![CDATA[Microsoft is permanently removing 3D Viewer from the Microsoft Store on July 1, 2026. As someone who works with STL and OBJ files daily, here's what I switched to.]]></summary></entry></feed>