oLLM: Running Huge-Context LLMs on Consumer GPUs
2025-09-23
oLLM is a lightweight Python library enabling inference of large-context LLMs like gpt-oss-20B and qwen3-next-80B on consumer GPUs with 8GB VRAM (e.g., a $200 Nvidia 3060 Ti), handling up to 100k context. This is achieved without quantization, by offloading layer weights and KV cache to SSD and employing techniques like FlashAttention-2 and chunked MLPs. Supporting various LLMs, oLLM offers a user-friendly API for large-scale text processing tasks such as analyzing contracts, summarizing medical literature, and processing massive log files.
Development
Low-Resource