Ollama Memory-Leak Watchdog And Hot-Swap Wrapper For Local LLM Servers That Auto-Recovers Before VRAM Hits The OOM And Queues Requests During Model Reload
Search interest in 'Ollama VRAM leak 2026' has spiked, with the community workaround being a systemctl/cron job that restarts Ollama daily. LM Studio has no headless mode. Local-LLM users running 24/7 inference on consumer GPUs (12-16GB) keep getting OOM-killed mid-session. The gap is a small production wrapper: monitors VRAM growth slope, restarts Ollama at a safe threshold (not on a wall clock), queues inbound requests during the 30-second restart window so callers get a graceful 503 + Retry-After rather than a connection error, and supports a request-side model-swap that warms a new model on a second GPU before tearing down the first. Aimed at solo developers running their own LLM endpoints.
Don't fork Ollama, wrap it. Sit in front as a thin reverse proxy, watch /api/ps and nvidia-smi, hold requests for 30s during recovery, and respond with a Retry-After header. Ship as a 50-line Go binary or Docker sidecar. The Ollama team has signaled they won't fix the leak themselves, which means this wrapper has a long shelf life.
landscape (4 existing solutions)
Production LLM serving exists (vLLM) and consumer LLM exploration exists (LM Studio, Ollama). Nothing fills the 'always-on personal LLM endpoint on a single consumer GPU' niche with production-grade reliability.