Monday, February 23, 2026
HomeMachine LearningGPU and CPU Utilization While Running Open-Source LLMs Locally using Ollama

GPU and CPU Utilization While Running Open-Source LLMs Locally using Ollama

Running Open-Source Large Language Models Locally with Ollama

Large Language Models (LLMs) have revolutionized natural language processing, but running them locally can be a daunting task due to the significant hardware resources they require. Many users opt for open-source models over closed-source ones because of their accessibility and cost-effectiveness. In this article, we will delve into how open-source LLMs work, using DeepSeek as a prime example.

Getting Started with Ollama

To simplify the process of running and managing LLMs locally, you can use Ollama. Here’s how you can get started:

  1. Download and install Ollama from the official website: https://ollama.com
  2. Alternatively, you can install Ollama via the command line using the following command:

curl -fsSL https://ollama.com/install.sh | sh

Downloading and Running Models Locally

Once Ollama is installed, you can easily download and run LLMs using simple command line commands. For instance:

Download and run DeepSeek-R1 7B:

ollama run deepseek-r1:7b

Download and run DeepSeek-R1 32B:

ollama run deepseek-r1:32b

Executing any of the above commands will initiate the download of the model and enable inference mode for the LLM.

Experiment Setup and Hardware Used

In this study, we utilized Ollama to run two different DeepSeek models:

  1. DeepSeek-R1 7B (small model)
  2. DeepSeek-R1 32B (large model)

The hardware configuration included:

  • GPU: NVIDIA RTX A4000 (16GB VRAM)
  • CPU: Intel Core i7–13700
  • RAM: 32GB
  • V(Video)RAM: 32GB

Insights into Model Storage and Execution

DeepSeek-R1 7B

Running the smaller model required 4GB of disk storage and ran entirely on the GPU, fitting comfortably within the 16GB VRAM. During inference, the model expanded in memory due to internal computations but stayed within the VRAM limits, ensuring GPU-only execution.

DeepSeek-R1 32B

The larger model demanded 20GB of disk storage and exceeded the GPU memory limit, reaching 48GB VRAM during inference. This led to a hybrid execution mode involving both CPU and GPU to balance the workload and ensure smooth operation.

Understanding VRAM Usage Expansion

The increase in VRAM usage during inference is attributed to internal computations within the model. Transformer-based LLMs like DeepSeek generate key-value matrices and employ multiple attention heads, demanding substantial memory. Activation functions and dynamically generated key-value matrices contribute to the VRAM expansion.

Monitoring Performance

Real-time monitoring of GPU and CPU utilization revealed key insights:

  • Smaller models ran efficiently on the GPU, providing fast inference.
  • Larger models automatically switched to a CPU-GPU hybrid execution when VRAM limits were surpassed.
  • Resource utilization monitoring aids in optimizing model selection based on available hardware.

Conclusion

Running open-source LLMs locally using tools like Ollama offers a cost-effective alternative to cloud-based solutions. DeepSeek models with Ollama provide a seamless experience, dynamically managing hardware limitations. Understanding the GPU-CPU balance is crucial for efficient LLM deployment.

For more insightful articles, stay tuned!

Published via Towards AI

Source

Must Read
Related News

LEAVE A REPLY

Please enter your comment!
Please enter your name here