Running Quantized LLMs on CPU and GPU Using Open-Source Tools

Large Language Models (LLMs) are no longer limited to expensive GPU clusters. Thanks to quantization techniques and open-source inference frameworks, developers and organizations can now run powerful models locally on CPUs, GPUs, or hybrid systems.

This article explains how LLM quantization works, how it enables CPU-bound and GPU-bound inference, and which open-source tools make deployment practical.

What Is LLM Quantization?

Quantization reduces the precision of model weights to decrease memory usage and accelerate inference.

Instead of storing parameters as:

FP32 (32-bit floating point)
FP16 (16-bit floating point)

Quantized models may use:

INT8
INT4
GPTQ
AWQ
GGUF formats

This significantly reduces hardware requirements.

For example:

A 7B parameter model:

FP16: ~14 GB RAM
INT8: ~7 GB RAM
INT4: ~3.5 GB RAM

This makes local inference possible on:

laptops
CPU servers
Apple Silicon
small GPU machines
edge devices

CPU-Bound LLM Inference

CPU inference is slower than GPU inference but extremely practical for:

private inference
edge deployments
low-cost infrastructure
compliance-sensitive workloads
automation agents

Popular CPU Inference Frameworks

llama.cpp

The most widely used CPU inference engine.

Key features:

Runs GGUF models
Supports AVX/AVX2/AVX512
Apple Metal acceleration
Vulkan and CUDA optional support
Extremely low memory footprint

Example:

./main -m model.gguf -p "Explain quantization"

llama.cpp works especially well for:

Mistral-7B
LLaMA-based models
small instruction-tuned models

GGUF Model Format

GGUF is a format optimized for fast local inference.

Benefits:

memory-mapped loading
faster startup
optimized tensor layout
portable across devices

Common quantization variants:

Q4_K_M
Q5_K_M
Q8_0

These balance performance vs quality.

GPU-Bound LLM Inference

GPU inference becomes important when you need:

higher throughput
larger models
concurrent users
lower latency
production APIs

Quantization still matters because it:

reduces VRAM usage
increases token throughput
enables larger models on smaller GPUs

vLLM for GPU Inference

One of the most popular open-source GPU inference servers is vLLM.

Key features:

PagedAttention memory system
OpenAI-compatible API server
tensor parallelism
high throughput batching
optimized CUDA kernels

Example server:

python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct \
  --quantization awq

vLLM works well with:

AWQ models
GPTQ models
FP16 models
multi-GPU deployments

GPTQ and AWQ Quantization

Two common GPU-focused quantization methods are:

GPTQ

Post-training quantization designed for GPUs.

Advantages:

strong quality retention
fast inference
widely supported

AWQ

Activation-aware quantization optimized for inference.

Advantages:

improved accuracy vs standard INT4
strong performance on modern GPUs
supported in vLLM and Transformers

Hybrid CPU + GPU Inference

Some systems use hybrid execution:

GPU handles transformer layers
CPU handles orchestration
disk memory used for offloading
batching managed by inference server

This allows running larger models than GPU memory alone would allow.

Frameworks that support hybrid workflows include:

llama.cpp (CPU + Metal/CUDA)
HuggingFace Transformers
vLLM
Ollama

Choosing the Right Setup

A simple rule of thumb:

CPU inference is best for:

automation
agents
private inference
small models
low concurrency

GPU inference is best for:

production APIs
chat applications
multi-user systems
larger models
high throughput

Quantization makes both practical.

Example Deployment Stack

A typical open-source local inference stack might look like:

Quantized GGUF model
llama.cpp runtime
FastAPI wrapper
LiteLLM gateway
pgvector memory store

For GPU deployments:

AWQ model
vLLM inference server
OpenAI-compatible API
reverse proxy (NGINX)
observability tools

Final Thoughts

Quantization has fundamentally changed how LLMs can be deployed. Instead of requiring massive GPU clusters, developers can now run powerful models on:

CPUs
consumer GPUs
Apple Silicon
edge hardware
private servers

Open-source tools like llama.cpp, vLLM, GGUF, GPTQ, and AWQ make local and private inference practical for real-world applications.

As models continue to improve and quantization techniques advance, CPU and GPU inference will become increasingly accessible to developers and businesses of all sizes.

If you’re building private inference infrastructure, automation agents, or AI-augmented development workflows, quantized LLMs are one of the most important technologies to understand today.