Ollama Memory-Leak Watchdog And Hot-Swap Wrapper For Local LLM Servers That Auto-Recovers Before VRAM Hits The OOM And Queues Requests During Model Reload

dev tool weekend hack •• multiple requests

Search interest in 'Ollama VRAM leak 2026' has spiked, with the community workaround being a systemctl/cron job that restarts Ollama daily. LM Studio has no headless mode. Local-LLM users running 24/7 inference on consumer GPUs (12-16GB) keep getting OOM-killed mid-session. The gap is a small production wrapper: monitors VRAM growth slope, restarts Ollama at a safe threshold (not on a wall clock), queues inbound requests during the 30-second restart window so callers get a graceful 503 + Retry-After rather than a connection error, and supports a request-side model-swap that warms a new model on a second GPU before tearing down the first. Aimed at solo developers running their own LLM endpoints.

builder note

Don't fork Ollama, wrap it. Sit in front as a thin reverse proxy, watch /api/ps and nvidia-smi, hold requests for 30s during recovery, and respond with a Retry-After header. Ship as a 50-line Go binary or Docker sidecar. The Ollama team has signaled they won't fix the leak themselves, which means this wrapper has a long shelf life.

landscape (4 existing solutions)

Production LLM serving exists (vLLM) and consumer LLM exploration exists (LM Studio, Ollama). Nothing fills the 'always-on personal LLM endpoint on a single consumer GPU' niche with production-grade reliability.

Ollama itself No built-in watchdog. Memory leaks at long uptimes are a known issue but the project's stance is 'restart it'. No request queue during restart.

vLLM Production-grade serving but built for data-center hardware. The setup curve is too steep for solo devs running a single 16GB RTX card on their desktop.

systemctl restart cron What everyone is doing today. Drops in-flight requests, no queue, and restart timing is a wall clock rather than a memory signal so you either restart too often (cold-start tax) or too late (OOM).

LM Studio Excellent GUI but explicitly not a server. No headless mode, requires app to be running interactively. Wrong product for the 'I want my AI sidecar always-on' use case.

sources (3)

other https://www.glukhov.org/llm-hosting/comparisons/hosting-llms... "Search for 'Ollama VRAM leak 2026' has spiked, with workarounds including scheduling daily restarts via systemctl or cron job." 2026-04-12

other https://open-techstack.com/blog/ollama-vs-lm-studio-2026/ "LM Studio has no headless mode, which is a significant limitation for server deployments." 2026-03-28

other https://localllm.in/blog/complete-guide-ollama-alternatives "The Complete Guide to Ollama Alternatives: 8 Best Local LLM Tools for 2026." 2026-04-18

local-llmollamareliabilityvramwatchdog