What is Intel AMX?
Intel Advanced Matrix Extensions (AMX) is an instruction set designed to improve AI inference performance on CPUs. It enhances the execution of matrix multiplication operations—a core component of many deep learning workloads—directly on Intel Xeon processors. AMX is part of Intel’s broader move to make CPUs more viable for AI inference by introducing architectural accelerations that can significantly improve throughput without relying on GPUs.
From a private infrastructure standpoint, support for AMX is relevant to organizations aiming to run AI workloads locally, without relying on public cloud GPUs. AMX enables high-throughput inference using CPUs already present in enterprise environments.
Hardware Support
Currently, 4th, 5th, and 6th generation Intel Xeon processors support the AMX instruction set. When selecting from an OpenMetal server, you’ll want to choose between a V3 or V4 server. Our Medium, Large, and XL servers all support the AMX instruction set.
Supported AMX-enabled OpenMetal Processors (as of 02/21/2025):
- Intel Xeon Gold 6526Y
- Intel Xeon Gold 6430
- Intel Xeon Gold 6530
- Intel Xeon Gold 5416S
- Intel Xeon Silver 4510
Software Support
System
Ensure that your system is using the latest Linux kernel. You’ll need version 5.19 or later. In our testing, we used ubuntu-24.04, kernel 6.8.0. AMX may also require enabling specific instruction sets via kernel modules or microcode updates depending on the distribution.
Inference Backend
You’ll need to use the right tool to ensure your inference runs with AMX. The three tools that we had success with were llama.cpp, FastChat with Intel Pytorch Extension, and OpenVino.
Llama.cpp
A C++ inference framework for quantized LLMs. It supports CPU execution with high efficiency. Llama.cpp benefits from instruction-level optimizations when compiled for AMX-compatible CPUs.
- A general-purpose inference tool that runs efficiently across various hardware.
- Forms the backend for Ollama, a wrapper that simplifies model deployment.
- https://github.com/ggml-org/llama.cpp
OpenVINO
Intel’s inference toolkit offering runtime optimization and quantization support. OpenVINO can deliver efficient performance on AMX-enabled CPUs through specific CPU plugins and execution providers.
- Intel’s optimization toolkit for inference.
- Provides the most comprehensive AI inference optimizations for Intel CPUs.
- Requires more configuration compared to other tools but delivers the best performance when properly tuned.
- https://github.com/openvinotoolkit/openvino.genai
FastChat with Intel PyTorch Extension (IPEX)
IPEX enhances PyTorch’s CPU backend using Intel optimizations. With AMX support, IPEX can reduce inference time and increase tokens-per-second throughput by leveraging matrix multiplication acceleration.
- A platform for training and serving LLM-based chatbots
- Installing Intel’s PyTorch Extension (IPEX) enables AMX acceleration.
- https://intel.github.io/intel-extension-for-pytorch/cpu/latest/
- https://github.com/lm-sys/FastChat
Performance Results
Benchmarks comparing 4th Gen and 5th Gen Intel Xeon processors show a 2x increase in throughput when AMX is enabled. Using quantized models such as llama 3.2B (Q4_K_M or Q8_0 formats), inference speeds of up to 57 tokens per second have been observed on AMX-enabled CPUs, compared to 28 tokens per second without AMX. These results were obtained using Llama.cpp compiled with appropriate flags.
We confirmed the results across 2 different CPUs using llama.cpp and meta-llama/llama3.2:3B and deepseek.r1 with Q8_0 quantization. With a lower bit quantization, you can push these results further. We achieved 80 t/s on Intel Xeon Gold 6530 with 4-bit quantization!
Note: AMX only officially supports BF16 and INT8 model sizes. However, llama.cpp contributors have developed support for various other quantizations. Depending on the tool and the quantization method you choose to use, you may be limited to the model sizes you can choose from. See this list for the AMX-supported quantization sizes: https://github.com/ggml-org/llama.cpp/blob/b4747/ggml/src/ggml-cpu/amx/common.h#L84
CPU vs GPU Inference Context
While GPUs still dominate training and high-throughput inference, the landscape is shifting:
- CPUs with AMX offer viable throughput for smaller, quantized models.
- CPU inference is well-suited for intermittent workloads, low-latency requirements, and secure environments where GPUs are unavailable or cost-prohibitive.
- Workload viability depends on target throughput. For instance, a CPU delivering 30 t/s can handle ~2,500 inference requests/day at 1,024 tokens per request. A GPU like the A100 may achieve 130 t/s, supporting ~11,000 requests/day.
Comparison to GPUs and Virtualization Options
In OpenStack-based environments, AMX-enabled CPUs can offer an alternative to virtualized GPUs. When GPUs are used, time-slicing and Multi-Instance GPU (MIG) modes offer resource sharing options:
- Time-Slicing: Allows multiple users to take turns using the GPU. Suitable for testing and development.
- MIG: Creates isolated GPU instances, offering predictable performance with lower latency.
AMX-powered CPU inference complements these approaches, especially when workloads do not require full GPU performance or when resource constraints necessitate CPU-based execution.
Final Thoughts
It’s clear that AI is here to stay, and we’re excited about the progress that both software and hardware are making. AMX is another move in the right direction. The performance we saw paired with modestly-sized models will give your team usable inference from the CPU. 40-50 tokens per second will be more than enough for many workloads.
AI workloads are becoming more CPU-friendly. In the long term, it’s not practical for GPUs to be the sole source of computing power for running AI workloads. We need AI development to be more accessible to engineers. We applaud Intel for its success with this latest improvement in lower-cost inference!
Interested in OpenMetal Cloud?
Schedule a Consultation
Get a deeper assessment and discuss your unique requirements.
Read More From OpenMetal