GPU and CPU Utilization While Running Open-Source LLMs Locally using Ollama

Running Open-Source Large Language Models Locally with Ollama

Large Language Models (LLMs) have revolutionized natural language processing, but running them locally can be a daunting task due to the significant hardware resources they require. Many users opt for open-source models over closed-source ones because of their accessibility and cost-effectiveness. In this article, we will delve into how open-source LLMs work, using DeepSeek as a prime example.

Getting Started with Ollama

To simplify the process of running and managing LLMs locally, you can use Ollama. Here’s how you can get started:

Download and install Ollama from the official website: https://ollama.com

Alternatively, you can install Ollama via the command line using the following command:

curl -fsSL https://ollama.com/install.sh | sh

Downloading and Running Models Locally

Once Ollama is installed, you can easily download and run LLMs using simple command line commands. For instance:

Download and run DeepSeek-R1 7B:

ollama run deepseek-r1:7b

Download and run DeepSeek-R1 32B:

ollama run deepseek-r1:32b

Executing any of the above commands will initiate the download of the model and enable inference mode for the LLM.

Experiment Setup and Hardware Used

In this study, we utilized Ollama to run two different DeepSeek models:

DeepSeek-R1 7B (small model)

DeepSeek-R1 32B (large model)

The hardware configuration included:

GPU: NVIDIA RTX A4000 (16GB VRAM)

CPU: Intel Core i7–13700

RAM: 32GB

V(Video)RAM: 32GB

Insights into Model Storage and Execution

DeepSeek-R1 7B

Running the smaller model required 4GB of disk storage and ran entirely on the GPU, fitting comfortably within the 16GB VRAM. During inference, the model expanded in memory due to internal computations but stayed within the VRAM limits, ensuring GPU-only execution.

DeepSeek-R1 32B

The larger model demanded 20GB of disk storage and exceeded the GPU memory limit, reaching 48GB VRAM during inference. This led to a hybrid execution mode involving both CPU and GPU to balance the workload and ensure smooth operation.

Understanding VRAM Usage Expansion

The increase in VRAM usage during inference is attributed to internal computations within the model. Transformer-based LLMs like DeepSeek generate key-value matrices and employ multiple attention heads, demanding substantial memory. Activation functions and dynamically generated key-value matrices contribute to the VRAM expansion.

Monitoring Performance

Real-time monitoring of GPU and CPU utilization revealed key insights:

Smaller models ran efficiently on the GPU, providing fast inference.

Larger models automatically switched to a CPU-GPU hybrid execution when VRAM limits were surpassed.

Resource utilization monitoring aids in optimizing model selection based on available hardware.

Conclusion

Running open-source LLMs locally using tools like Ollama offers a cost-effective alternative to cloud-based solutions. DeepSeek models with Ollama provide a seamless experience, dynamically managing hardware limitations. Understanding the GPU-CPU balance is crucial for efficient LLM deployment.

For more insightful articles, stay tuned!

Published via Towards AI

Source

ASUS Computex 2026 brings AI to all form factors across its lineup

Beyond Instagram: Introducing the Next Generation of Social Apps

An announcement from the Steering Council regarding the JIT project

As pro-life pressure mounts on Trump, FDA investigates safety of abortion pill: WSJ

GPU and CPU Utilization While Running Open-Source LLMs Locally using Ollama

Running Open-Source Large Language Models Locally with Ollama

Getting Started with Ollama

Downloading and Running Models Locally

Experiment Setup and Hardware Used

Insights into Model Storage and Execution

DeepSeek-R1 7B

DeepSeek-R1 32B

Understanding VRAM Usage Expansion

Monitoring Performance

Conclusion

ASUS Computex 2026 brings AI to all form factors across its lineup

Beyond Instagram: Introducing the Next Generation of Social Apps

An announcement from the Steering Council regarding the JIT project

As pro-life pressure mounts on Trump, FDA investigates safety of abortion pill: WSJ

White House releases new AI security framework – THE Journal

NVIDIA Nemotron 3 Ultra now available on Amazon SageMaker JumpStart

5 fun articles that clearly explain LLMs

Claude Code Casual, Pro, Elite: the three working characters of Claude Code Mastery

The next chapter in flood resilience: Google’s open source hydrology framework

OpenAI models and Codex on Amazon Bedrock are now generally available

LEAVE A REPLY Cancel reply

Useful Links

Latest News

Beyond Instagram: Introducing the Next Generation of Social Apps

An announcement from the Steering Council regarding the JIT project

As pro-life pressure mounts on Trump, FDA investigates safety of abortion pill: WSJ

Our Newsletter