Intel AMX Enables High-Efficiency CPU Inference for AI Workloads

Resources » Blog » Intel AMX Enables High-Efficiency CPU Inference for AI Workloads

What is Intel AMX?

Intel Advanced Matrix Extensions (AMX) is an instruction set designed to improve AI inference performance on CPUs. It enhances the execution of matrix multiplication operations—a core component of many deep learning workloads—directly on Intel Xeon processors. AMX is part of Intel’s broader move to make CPUs more viable for AI inference by introducing architectural accelerations that can significantly improve throughput without relying on GPUs.

From a private infrastructure standpoint, support for AMX is relevant to organizations aiming to run AI workloads locally, without relying on public cloud GPUs. AMX enables high-throughput inference using CPUs already present in enterprise environments.

Hardware Support

Currently, 4th, 5th, and 6th generation Intel Xeon processors support the AMX instruction set. When selecting from an OpenMetal server, you’ll want to choose between a V3 or V4 server. Our Medium, Large, and XL servers all support the AMX instruction set.

Supported AMX-enabled OpenMetal Processors (as of 02/21/2025):

Intel Xeon Gold 6526Y
Intel Xeon Gold 6430
Intel Xeon Gold 6530
Intel Xeon Gold 5416S
Intel Xeon Silver 4510

Software Support

System

Ensure that your system is using the latest Linux kernel. You’ll need version 5.19 or later. In our testing, we used ubuntu-24.04, kernel 6.8.0. AMX may also require enabling specific instruction sets via kernel modules or microcode updates depending on the distribution.

Inference Backend

You’ll need to use the right tool to ensure your inference runs with AMX. The three tools that we had success with were llama.cpp, FastChat with Intel Pytorch Extension, and OpenVino.

Llama.cpp

A C++ inference framework for quantized LLMs. It supports CPU execution with high efficiency. Llama.cpp benefits from instruction-level optimizations when compiled for AMX-compatible CPUs.

A general-purpose inference tool that runs efficiently across various hardware.
Forms the backend for Ollama, a wrapper that simplifies model deployment.
https://github.com/ggml-org/llama.cpp

OpenVINO

Intel’s inference toolkit offering runtime optimization and quantization support. OpenVINO can deliver efficient performance on AMX-enabled CPUs through specific CPU plugins and execution providers.

Intel’s optimization toolkit for inference.
Provides the most comprehensive AI inference optimizations for Intel CPUs.
Requires more configuration compared to other tools but delivers the best performance when properly tuned.
https://github.com/openvinotoolkit/openvino.genai

FastChat with Intel PyTorch Extension (IPEX)

IPEX enhances PyTorch’s CPU backend using Intel optimizations. With AMX support, IPEX can reduce inference time and increase tokens-per-second throughput by leveraging matrix multiplication acceleration.

A platform for training and serving LLM-based chatbots
Installing Intel’s PyTorch Extension (IPEX) enables AMX acceleration.
https://intel.github.io/intel-extension-for-pytorch/cpu/latest/
https://github.com/lm-sys/FastChat

Performance Results

Benchmarks comparing 4th Gen and 5th Gen Intel Xeon processors show a 2x increase in throughput when AMX is enabled. Using quantized models such as llama 3.2B (Q4_K_M or Q8_0 formats), inference speeds of up to 57 tokens per second have been observed on AMX-enabled CPUs, compared to 28 tokens per second without AMX. These results were obtained using Llama.cpp compiled with appropriate flags.

We confirmed the results across 2 different CPUs using llama.cpp and meta-llama/llama3.2:3B and deepseek.r1 with Q8_0 quantization. With a lower bit quantization, you can push these results further. We achieved 80 t/s on Intel Xeon Gold 6530 with 4-bit quantization!

Note: AMX only officially supports BF16 and INT8 model sizes. However, llama.cpp contributors have developed support for various other quantizations. Depending on the tool and the quantization method you choose to use, you may be limited to the model sizes you can choose from. See this list for the AMX-supported quantization sizes: https://github.com/ggml-org/llama.cpp/blob/b4747/ggml/src/ggml-cpu/amx/common.h#L84

CPU vs GPU Inference Context

While GPUs still dominate training and high-throughput inference, the landscape is shifting:

CPUs with AMX offer viable throughput for smaller, quantized models.
CPU inference is well-suited for intermittent workloads, low-latency requirements, and secure environments where GPUs are unavailable or cost-prohibitive.
Workload viability depends on target throughput. For instance, a CPU delivering 30 t/s can handle ~2,500 inference requests/day at 1,024 tokens per request. A GPU like the A100 may achieve 130 t/s, supporting ~11,000 requests/day.

Comparison to GPUs and Virtualization Options

In OpenStack-based environments, AMX-enabled CPUs can offer an alternative to virtualized GPUs. When GPUs are used, time-slicing and Multi-Instance GPU (MIG) modes offer resource sharing options:

Time-Slicing: Allows multiple users to take turns using the GPU. Suitable for testing and development.
MIG: Creates isolated GPU instances, offering predictable performance with lower latency.

AMX-powered CPU inference complements these approaches, especially when workloads do not require full GPU performance or when resource constraints necessitate CPU-based execution.

Final Thoughts

It’s clear that AI is here to stay, and we’re excited about the progress that both software and hardware are making. AMX is another move in the right direction. The performance we saw paired with modestly-sized models will give your team usable inference from the CPU. 40-50 tokens per second will be more than enough for many workloads.

AI workloads are becoming more CPU-friendly. In the long term, it’s not practical for GPUs to be the sole source of computing power for running AI workloads. We need AI development to be more accessible to engineers. We applaud Intel for its success with this latest improvement in lower-cost inference!

Interested in GPU Servers and Clusters?

GPU Server Pricing

High-performance GPU hardware with detailed specs and transparent pricing.

View Options

Schedule a Consultation

Let’s discuss your GPU or AI needs and tailor a solution that fits your goals.

Schedule Meeting

Private AI Labs

$50k in credits to accelerate your AI project in a secure, private environment.

Apply Now

Intel AMX Enables High-Efficiency CPU Inference for AI Workloads

What is Intel AMX?

Hardware Support

Software Support

System

Inference Backend

Llama.cpp

OpenVINO

FastChat with Intel PyTorch Extension (IPEX)

Performance Results

CPU vs GPU Inference Context

Comparison to GPUs and Virtualization Options

Final Thoughts

Interested in GPU Servers and Clusters?

GPU Server Pricing

Schedule a Consultation

Private AI Labs

Building a Scalable MLOps Platform from Scratch on OpenMetal

AI Use Case: Hosting OpenAI Whisper on a Private GPU Cloud – A Strategic Advantage for Media Companies

AI Use Case: Hosting BioGPT on a Private GPU Cloud for Biomedical NLP

MicroVMs: Scaling Out Over Scaling Up in Modern Cloud Architectures

Enabling Intel SGX and TDX on OpenMetal v4 Servers: Hardware Requirements

10 Hugging Face Model Types and Domains that are Perfect for Private AI Infrastructure

Building an On-Demand GPU Cloud: A Guide for Cloud Resellers Using OpenMetal’s Private GPU Servers

Don’t Bet Your AI Startup on Public Cloud by Default – Here’s Where Private Infrastructure Wins

Secure and Scalable AI Experimentation with Kasm Workspaces and OpenMetal

Announcing the launch of Private AI Labs Program – Up to $50K in infrastructure usage credits