AI model performance is often measured using the metric tokens per second (TPS). This figure provides a standardized way to understand the speed of model inference across different hardware, models, and configurations.

Understanding Tokens

A token is a unit of text processed by the model. It may represent a word, part of a word, or even punctuation depending on the tokenizer used. For example:

  • “Hello” might be a single token.
  • “unbelievable” might be split into multiple tokens: “un”, “believ”, “able”.

Tokenization varies by model family, impacting token counts for the same input. This directly affects performance measurements since each token requires computation.

In language models, tokens are essential to how models operate. Models generate output one token at a time, predicting the most likely next token based on the previous context. The definition of a token depends on the tokenizer design, which may be based on byte-pair encoding (BPE), WordPiece, or similar algorithms that balance vocabulary size and sequence length.

For context:

  • A typical English word may average 1.3 to 2 tokens depending on complexity.
  • Punctuation and formatting, such as spaces or line breaks, are often separate tokens.
  • In code generation tasks, tokens may represent parts of keywords, operators, or symbols.

Because token counts directly influence compute time, understanding how your dataset tokenizes is critical for estimating cost and performance.

Measuring Tokens per Second

Tokens per second is the number of tokens a model can process (input and/or output) per second during inference. It is influenced by several factors:

  • Model size (number of parameters)
  • Hardware (GPU vs. CPU)
  • Software optimizations (quantization, inference engines)

When evaluating TPS, it is important to distinguish between:

  • Prompt (input) tokens per second — how fast the model reads and processes the input.
  • Eval (output) tokens per second — how fast the model generates responses.

Output token generation is compute-intensive and typically slower. Measuring only input TPS can create misleading impressions of performance, as input handling is faster than generation. From our measurements, prompt tokens per second can be as much as 10x higher than eval tokens per second. This separation in performance will vary by model and hardware.

An easy way for a benchmark to skew results is to provide total tokens per second with a large prompt and small output.

Why the Difference Matters

In many workloads, especially text generation, the output size far exceeds the input. For instance:

  • A chatbot might process a 100-token prompt but generate a 500-token response.
  • Code completion might process a large input context but output only a few tokens.

In these cases, measuring eval TPS provides a more realistic view of latency and throughput than input TPS. Benchmarking systems should clearly state which metric they use.

Model Sizes and Impact on Performance

Model size, measured in parameters, significantly affects performance. 

Model ExampleParameters

Typical Eval TPS (Intel Xeon Gold 6530)

Typical Eval TPS (A100)
llama 3.2 3B3 Billion~52 TPS~171 TPS
llama 3.1 8B8 Billion~33 TPS~122 TPS

Larger models generate higher quality outputs but demand more compute power. Smaller models or quantized versions offer better TPS but may reduce output accuracy. If your use case can tolerate lower quality responses, then a cost effective alternative could include running your workloads on a CPU.

Quantization and Performance

Quantization reduces the precision of model weights, typically from 16-bit floating point (FP16) to smaller formats like 4-bit. This is a common strategy for improving AI model inference performance, especially on CPUs or edge devices with limited resources.

Benefits of Quantization:

  • Faster Inference: Lower precision reduces the computational load, enabling faster token generation.
  • Reduced Memory Usage: Quantized models consume less memory, allowing larger models to fit on limited hardware.
  • Lower Power Consumption: The reduced compute demand translates to lower energy usage, which is critical for large-scale deployments.

Quantization Levels and Types:

  • INT8 Quantization: One of the most common formats for production environments. Balances speed and acceptable accuracy loss.
  • 4-bit Quantization (FP4, INT4): Emerging methods that further compress models. Useful for specialized workloads but may introduce accuracy challenges.
  • Dynamic vs. Static Quantization: Dynamic quantization converts weights during inference, while static quantization converts ahead of time, offering better performance.

Trade-Offs:

Quantization may lead to slight accuracy degradation, especially on tasks requiring nuanced understanding or precise outputs. The extent of impact depends on the model architecture, the quantization method, and the dataset.

Quantized models are widely used when deploying large language models on CPUs or when running AI workloads in private clouds where resource efficiency is a priority.

Inferencing Tools and Optimization

Several software tools help optimize performance. These tools maximize hardware utilization, reduce latency, and improve throughput during AI model inference. They allow models to run more efficiently on different hardware platforms, helping organizations lower costs, increase scalability, and meet performance targets for demanding workloads:

  • OpenVINO: Intel’s toolkit that boosts AI workloads on CPUs, offering up to 2x speed improvement by using Intel’s AMX, an instruction set designed to improve inference performance.
  • Intel PyTorch Extension (IPEX): A newer optimization layer for PyTorch models running on Intel CPUs, achieving up to 25% gains in tests.
  • NVIDIA TensorRT: For GPU inference optimization, improving throughput for large models.
  • Hugging Face Optimum: Integrates various backends for model acceleration.

Measuring Performance in Real-World Scenarios

Concurrency Considerations

Concurrent inference requests affect model performance, particularly on shared hardware:

  • On CPUs, high concurrency leads to contention for cores and memory bandwidth.
  • On GPUs, concurrency is limited by available CUDA cores and VRAM.

Planning for concurrency is essential in multi-user environments or APIs that require rapid response times.

Model Size and Token Ratio Impact

The ratio of input tokens to output tokens varies by use case:

  • Conversational AI often has a low input-to-output ratio (short prompts, long responses).
  • Search or summarization tasks may have large input contexts but produce small outputs.

Measuring eval TPS ensures the most compute-heavy part of the process is tested. Benchmarking should also document the token ratio used in tests.

Use-Case Specific Targets

Real-world requirements vary:

  • Code completion needs low latency and may target 50–100 TPS per user.
  • Batch document summarization tolerates slower TPS but requires throughput efficiency.
  • RAG pipelines combining search and generation may mix large inputs and long outputs, stressing both prompt and eval performance.

Conclusion

Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments. Clear benchmarks enable better capacity planning, especially when deploying AI workloads on private clouds or bare metal.

Interested in GPU Servers and Clusters?

GPU Server Pricing

High-performance GPU hardware with detailed specs and transparent pricing.

View Options

Schedule a Consultation

Let’s discuss your GPU or AI needs and tailor a solution that fits your goals.

Schedule Meeting

Private AI Labs

$50k in credits to accelerate your AI project in a secure, private environment.

Apply Now

Read More From OpenMetal

Don’t Bet Your AI Startup on Public Cloud by Default – Here’s Where Private Infrastructure Wins

Many AI startups default to public cloud and face soaring costs, performance issues, and compliance risks. This article explores how private AI infrastructure delivers predictable pricing, dedicated resources, and better business outcomes—setting you up for success.

Secure and Scalable AI Experimentation with Kasm Workspaces and OpenMetal

In a recent live webinar, OpenMetal’s Todd Robinson sat down with Emrul Islam from Kasm Technologies to explore how container-based Virtual Desktop Infrastructure (VDI) and infrastructure flexibility can empower teams tackling everything from machine learning research to high-security operations.

Announcing the launch of Private AI Labs Program – Up to $50K in infrastructure usage credits

With the new OpenMetal Private AI Labs program, you can access private GPU servers and clusters tailored for your AI projects. By joining, you’ll receive up to $50,000 in usage credits to test, build, and scale your AI workloads.

GPU Servers & Clusters Now Available on OpenMetal – Powering Private AI, ML & More

GPU Servers and Clusters are now available on OpenMetal—giving you dedicated access to enterprise-grade NVIDIA A100 and H100 GPUs on fully private, high-performance infrastructure.

Cold Start Latency in AI Inference: Why It Matters in Private Environments

Cold start latency becomes a visible and impactful factor in private environments and can slow down AI inference, especially when infrastructure is deployed on-demand to optimize resource usage or reduce costs. Learn causes, impacts, and how to reduce delay for faster, reliable performance.

Intel AMX Enables High-Efficiency CPU Inference for AI Workloads

Intel Advanced Matrix Extensions (AMX) is an instruction set designed to improve AI inference performance on CPUs. It enhances the execution of matrix multiplication operations—a core component of many deep learning workloads—directly on Intel Xeon processors. AMX is part of Intel’s broader move to make CPUs more viable for AI inference by introducing architectural accelerations that can significantly improve throughput without relying on GPUs.

Comparing Multi-Instance GPU (MIG) and Time-Slicing for GPU Resource Sharing

Modern GPU technologies offer multiple methods for sharing hardware resources across workloads. Two widely used approaches are Multi-Instance GPU (MIG) and time-slicing. Both methods aim to improve utilization and reduce costs, but they differ significantly in implementation, performance, and isolation.

Comparing GPU Costs for AI Workloads: Factors Beyond Hardware Price

When comparing GPU costs between providers, the price of the GPU alone does not reflect the total cost or value of the service. The architecture of the deployment, access levels, support for GPU features, and billing models significantly affect long-term expenses and usability.

Comparing NVIDIA H100 vs A100 GPUs for AI Workloads

As demand for AI and machine learning infrastructure accelerates, hardware decisions increasingly affect both model performance and operational costs. The NVIDIA A100 and H100 are two of the most widely adopted GPUs for large-scale AI workloads. While both support advanced features like Multi-Instance GPU (MIG), they differ significantly in performance, architecture, and use case suitability.

Comparing AI Compute Options: API Endpoints, Public Cloud GPUs, and Private Cloud GPU Deployments

The demand for GPU compute resources has expanded alongside the growth of AI and machine learning workloads. Users today have multiple pathways to access these resources depending on their requirements for cost, control, and performance. This article breaks down three common tiers of AI compute services, their advantages, and trade-offs.