What is Intel AMX?

Intel Advanced Matrix Extensions (AMX) is an instruction set designed to improve AI inference performance on CPUs. It enhances the execution of matrix multiplication operations—a core component of many deep learning workloads—directly on Intel Xeon processors. AMX is part of Intel’s broader move to make CPUs more viable for AI inference by introducing architectural accelerations that can significantly improve throughput without relying on GPUs.

From a private infrastructure standpoint, support for AMX is relevant to organizations aiming to run AI workloads locally, without relying on public cloud GPUs. AMX enables high-throughput inference using CPUs already present in enterprise environments.

Hardware Support

Currently, 4th, 5th, and 6th generation Intel Xeon processors support the AMX instruction set. When selecting from an OpenMetal server, you’ll want to choose between a V3 or V4 server. Our Medium, Large, and XL servers all support the AMX instruction set. 

Supported AMX-enabled OpenMetal Processors (as of 02/21/2025):

  • Intel Xeon Gold 6526Y
  • Intel Xeon Gold 6430
  • Intel Xeon Gold 6530
  • Intel Xeon Gold 5416S
  • Intel Xeon Silver 4510

Software Support

System

Ensure that your system is using the latest Linux kernel. You’ll need version 5.19 or later. In our testing, we used ubuntu-24.04, kernel 6.8.0. AMX may also require enabling specific instruction sets via kernel modules or microcode updates depending on the distribution.

Inference Backend

You’ll need to use the right tool to ensure your inference runs with AMX. The three tools that we had success with were llama.cpp, FastChat with Intel Pytorch Extension, and OpenVino. 

Llama.cpp

A C++ inference framework for quantized LLMs. It supports CPU execution with high efficiency. Llama.cpp benefits from instruction-level optimizations when compiled for AMX-compatible CPUs.

  • A general-purpose inference tool that runs efficiently across various hardware.
  • Forms the backend for Ollama, a wrapper that simplifies model deployment.
  • https://github.com/ggml-org/llama.cpp 

OpenVINO

Intel’s inference toolkit offering runtime optimization and quantization support. OpenVINO can deliver efficient performance on AMX-enabled CPUs through specific CPU plugins and execution providers.

  • Intel’s optimization toolkit for inference.
  • Provides the most comprehensive AI inference optimizations for Intel CPUs.
  • Requires more configuration compared to other tools but delivers the best performance when properly tuned.
  • https://github.com/openvinotoolkit/openvino.genai 

FastChat with Intel PyTorch Extension (IPEX)

IPEX enhances PyTorch’s CPU backend using Intel optimizations. With AMX support, IPEX can reduce inference time and increase tokens-per-second throughput by leveraging matrix multiplication acceleration.

 

Performance Results

Benchmarks comparing 4th Gen and 5th Gen Intel Xeon processors show a 2x increase in throughput when AMX is enabled. Using quantized models such as llama 3.2B (Q4_K_M or Q8_0 formats), inference speeds of up to 57 tokens per second have been observed on AMX-enabled CPUs, compared to 28 tokens per second without AMX. These results were obtained using Llama.cpp compiled with appropriate flags.

We confirmed the results acrossdifferent CPUs using llama.cpp and meta-llama/llama3.2:3B and deepseek.r1 with Q8_0 quantization. With a lower bit quantization, you can push these results further. We achieved 80 t/s on Intel Xeon Gold 6530 with 4-bit quantization!

Note: AMX only officially supports BF16 and INT8 model sizes. However, llama.cpp contributors have developed support for various other quantizations. Depending on the tool and the quantization method you choose to use, you may be limited to the model sizes you can choose from. See this list for the AMX-supported quantization sizes: https://github.com/ggml-org/llama.cpp/blob/b4747/ggml/src/ggml-cpu/amx/common.h#L84

Intel 5th Gen AMX - Enabled Tokens Per Second Chart

CPU vs GPU Inference Context

While GPUs still dominate training and high-throughput inference, the landscape is shifting:

  • CPUs with AMX offer viable throughput for smaller, quantized models.
  • CPU inference is well-suited for intermittent workloads, low-latency requirements, and secure environments where GPUs are unavailable or cost-prohibitive.
  • Workload viability depends on target throughput. For instance, a CPU delivering 30 t/s can handle ~2,500 inference requests/day at 1,024 tokens per request. A GPU like the A100 may achieve 130 t/s, supporting ~11,000 requests/day.

Comparison to GPUs and Virtualization Options

In OpenStack-based environments, AMX-enabled CPUs can offer an alternative to virtualized GPUs. When GPUs are used, time-slicing and Multi-Instance GPU (MIG) modes offer resource sharing options:

  • Time-Slicing: Allows multiple users to take turns using the GPU. Suitable for testing and development.
  • MIG: Creates isolated GPU instances, offering predictable performance with lower latency​.

AMX-powered CPU inference complements these approaches, especially when workloads do not require full GPU performance or when resource constraints necessitate CPU-based execution.

Final Thoughts

It’s clear that AI is here to stay, and we’re excited about the progress that both software and hardware are making. AMX is another move in the right direction. The performance we saw paired with modestly-sized models will give your team usable inference from the CPU. 40-50 tokens per second will be more than enough for many workloads.

AI workloads are becoming more CPU-friendly. In the long term, it’s not practical for GPUs to be the sole source of computing power for running AI workloads. We need AI development to be more accessible to engineers. We applaud Intel for its success with this latest improvement in lower-cost inference!

Interested in OpenMetal Cloud?

Chat With Our Team

We’re available to answer questions and provide information.

Chat With Us

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

Read More From OpenMetal

Intel AMX Enables High-Efficiency CPU Inference for AI Workloads

Intel Advanced Matrix Extensions (AMX) is an instruction set designed to improve AI inference performance on CPUs. It enhances the execution of matrix multiplication operations—a core component of many deep learning workloads—directly on Intel Xeon processors. AMX is part of Intel’s broader move to make CPUs more viable for AI inference by introducing architectural accelerations that can significantly improve throughput without relying on GPUs.

Comparing Multi-Instance GPU (MIG) and Time-Slicing for GPU Resource Sharing

Modern GPU technologies offer multiple methods for sharing hardware resources across workloads. Two widely used approaches are Multi-Instance GPU (MIG) and time-slicing. Both methods aim to improve utilization and reduce costs, but they differ significantly in implementation, performance, and isolation.

Comparing GPU Costs for AI Workloads: Factors Beyond Hardware Price

When comparing GPU costs between providers, the price of the GPU alone does not reflect the total cost or value of the service. The architecture of the deployment, access levels, support for GPU features, and billing models significantly affect long-term expenses and usability.

Comparing NVIDIA H100 vs A100 GPUs for AI Workloads

As demand for AI and machine learning infrastructure accelerates, hardware decisions increasingly affect both model performance and operational costs. The NVIDIA A100 and H100 are two of the most widely adopted GPUs for large-scale AI workloads. While both support advanced features like Multi-Instance GPU (MIG), they differ significantly in performance, architecture, and use case suitability.

Comparing AI Compute Options: API Endpoints, Public Cloud GPUs, and Private Cloud GPU Deployments

The demand for GPU compute resources has expanded alongside the growth of AI and machine learning workloads. Users today have multiple pathways to access these resources depending on their requirements for cost, control, and performance. This article breaks down three common tiers of AI compute services, their advantages, and trade-offs.

Solving AI’s Most Pressing Deployment Challenges: Secure Collaboration, Infrastructure Sprawl, and Scalable Experimentation

Explore real-world solutions to AI deployment challenges—from managing secure, container-based environments to scaling GPU infrastructure efficiently. Learn how Kasm Workspaces and OpenMetal enable secure collaboration, cost control, and streamlined experimentation for AI teams.

Measuring AI Model Performance: Tokens per Second, Model Sizes, and Inferencing Tools

Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments.

Scaling AI with Open Infra: OpenMetal’s Perspective on the Future of Open Source AI Infrastructure

This article highlights OpenMetal’s perspective on AI infrastructure, as shared by Todd Robinson at OpenInfra Days 2025. It explores how OpenInfra, particularly OpenStack, enables scalable, cost-efficient AI workloads while avoiding hyperscaler lock-in.

Unlocking Private AI: CPU vs. GPU Inference (SCaLE 22x and OpenInfra Days 2025)

At OpenMetal, you can deploy AI models on your own infrastructure, balancing CPU vs. GPU inference for cost and performance, and maintaining full control over data privacy.

10 Essential AI Tools for WordPress Agencies: Transforming Workflows, Design, and Client Solutions

10 essential AI tools WordPress agencies can explore to streamline workflows, enhance customer operations, and stay competitive.