AI model performance is often measured using the metric tokens per second (TPS). This figure provides a standardized way to understand the speed of model inference across different hardware, models, and configurations.

Understanding Tokens

A token is a unit of text processed by the model. It may represent a word, part of a word, or even punctuation depending on the tokenizer used. For example:

  • “Hello” might be a single token.
  • “unbelievable” might be split into multiple tokens: “un”, “believ”, “able”.

Tokenization varies by model family, impacting token counts for the same input. This directly affects performance measurements since each token requires computation.

In language models, tokens are essential to how models operate. Models generate output one token at a time, predicting the most likely next token based on the previous context. The definition of a token depends on the tokenizer design, which may be based on byte-pair encoding (BPE), WordPiece, or similar algorithms that balance vocabulary size and sequence length.

For context:

  • A typical English word may average 1.3 to 2 tokens depending on complexity.
  • Punctuation and formatting, such as spaces or line breaks, are often separate tokens.
  • In code generation tasks, tokens may represent parts of keywords, operators, or symbols.

Because token counts directly influence compute time, understanding how your dataset tokenizes is critical for estimating cost and performance.

Measuring Tokens per Second

Tokens per second is the number of tokens a model can process (input and/or output) per second during inference. It is influenced by several factors:

  • Model size (number of parameters)
  • Hardware (GPU vs. CPU)
  • Software optimizations (quantization, inference engines)

When evaluating TPS, it is important to distinguish between:

  • Prompt (input) tokens per second — how fast the model reads and processes the input.
  • Eval (output) tokens per second — how fast the model generates responses.

Output token generation is compute-intensive and typically slower. Measuring only input TPS can create misleading impressions of performance, as input handling is faster than generation. From our measurements, prompt tokens per second can be as much as 10x higher than eval tokens per second. This separation in performance will vary by model and hardware.

An easy way for a benchmark to skew results is to provide total tokens per second with a large prompt and small output.

Why the Difference Matters

In many workloads, especially text generation, the output size far exceeds the input. For instance:

  • A chatbot might process a 100-token prompt but generate a 500-token response.
  • Code completion might process a large input context but output only a few tokens.

In these cases, measuring eval TPS provides a more realistic view of latency and throughput than input TPS. Benchmarking systems should clearly state which metric they use.

Model Sizes and Impact on Performance

Model size, measured in parameters, significantly affects performance. 

Model ExampleParameters

Typical Eval TPS (Intel Xeon Gold 6530)

Typical Eval TPS (A100)
llama 3.2 3B3 Billion~52 TPS~171 TPS
llama 3.1 8B8 Billion~33 TPS~122 TPS

Larger models generate higher quality outputs but demand more compute power. Smaller models or quantized versions offer better TPS but may reduce output accuracy. If your use case can tolerate lower quality responses, then a cost effective alternative could include running your workloads on a CPU.

Quantization and Performance

Quantization reduces the precision of model weights, typically from 16-bit floating point (FP16) to smaller formats like 4-bit. This is a common strategy for improving AI model inference performance, especially on CPUs or edge devices with limited resources.

Benefits of Quantization:

  • Faster Inference: Lower precision reduces the computational load, enabling faster token generation.
  • Reduced Memory Usage: Quantized models consume less memory, allowing larger models to fit on limited hardware.
  • Lower Power Consumption: The reduced compute demand translates to lower energy usage, which is critical for large-scale deployments.

Quantization Levels and Types:

  • INT8 Quantization: One of the most common formats for production environments. Balances speed and acceptable accuracy loss.
  • 4-bit Quantization (FP4, INT4): Emerging methods that further compress models. Useful for specialized workloads but may introduce accuracy challenges.
  • Dynamic vs. Static Quantization: Dynamic quantization converts weights during inference, while static quantization converts ahead of time, offering better performance.

Trade-Offs:

Quantization may lead to slight accuracy degradation, especially on tasks requiring nuanced understanding or precise outputs. The extent of impact depends on the model architecture, the quantization method, and the dataset.

Quantized models are widely used when deploying large language models on CPUs or when running AI workloads in private clouds where resource efficiency is a priority.

Inferencing Tools and Optimization

Several software tools help optimize performance. These tools maximize hardware utilization, reduce latency, and improve throughput during AI model inference. They allow models to run more efficiently on different hardware platforms, helping organizations lower costs, increase scalability, and meet performance targets for demanding workloads:

  • OpenVINO: Intel’s toolkit that boosts AI workloads on CPUs, offering up to 2x speed improvement by using Intel’s AMX, an instruction set designed to improve inference performance.
  • Intel PyTorch Extension (IPEX): A newer optimization layer for PyTorch models running on Intel CPUs, achieving up to 25% gains in tests.
  • NVIDIA TensorRT: For GPU inference optimization, improving throughput for large models.
  • Hugging Face Optimum: Integrates various backends for model acceleration.

Measuring Performance in Real-World Scenarios

Concurrency Considerations

Concurrent inference requests affect model performance, particularly on shared hardware:

  • On CPUs, high concurrency leads to contention for cores and memory bandwidth.
  • On GPUs, concurrency is limited by available CUDA cores and VRAM.

Planning for concurrency is essential in multi-user environments or APIs that require rapid response times.

Model Size and Token Ratio Impact

The ratio of input tokens to output tokens varies by use case:

  • Conversational AI often has a low input-to-output ratio (short prompts, long responses).
  • Search or summarization tasks may have large input contexts but produce small outputs.

Measuring eval TPS ensures the most compute-heavy part of the process is tested. Benchmarking should also document the token ratio used in tests.

Use-Case Specific Targets

Real-world requirements vary:

  • Code completion needs low latency and may target 50–100 TPS per user.
  • Batch document summarization tolerates slower TPS but requires throughput efficiency.
  • RAG pipelines combining search and generation may mix large inputs and long outputs, stressing both prompt and eval performance.

Conclusion

Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments. Clear benchmarks enable better capacity planning, especially when deploying AI workloads on private clouds or bare metal.

Interested in GPU Servers and Clusters?

GPU Server Pricing

High-performance GPU hardware with detailed specs and transparent pricing.

View Options

Schedule a Consultation

Let’s discuss your GPU or AI needs and tailor a solution that fits your goals.

Schedule Meeting

Private AI Labs

$50k in credits to accelerate your AI project in a secure, private environment.

Apply Now

Read More From OpenMetal

Building a Scalable MLOps Platform from Scratch on OpenMetal

Tired of slow model training and unpredictable cloud costs? Learn how to build a powerful, cost-effective MLOps platform from scratch with OpenMetal’s hosted private and bare metal cloud solutions. This comprehensive guide provides the blueprint for taking control of your entire machine learning lifecycle.

AI Use Case: Hosting OpenAI Whisper on a Private GPU Cloud – A Strategic Advantage for Media Companies

Learn how media companies can deploy OpenAI Whisper on a private GPU cloud for large-scale, real-time transcription, automated multilingual subtitling, and searchable archives. Ensure full data sovereignty, predictable costs, and enterprise-grade security for all your content workflows.

AI Use Case: Hosting BioGPT on a Private GPU Cloud for Biomedical NLP

Discover how IT teams can deploy BioGPT on OpenMetal’s dedicated NVIDIA GPU servers within a private OpenStack cloud. Learn strategic best practices for compliance-ready setups (HIPAA, GDPR), high-performance inference, cost transparency, and in-house model fine-tuning for biomedical research.

MicroVMs: Scaling Out Over Scaling Up in Modern Cloud Architectures

Explore how MicroVMs deliver fast, secure, and resource-efficient horizontal scaling for modern workloads like serverless platforms, high-concurrency APIs, and AI inference. Discover how OpenMetal’s high-performance private cloud and bare metal infrastructure supports scalable MicroVM deployments.

Enabling Intel SGX and TDX on OpenMetal v4 Servers: Hardware Requirements

Learn how to enable Intel SGX and TDX on OpenMetal’s Medium, Large, XL, and XXL v4 servers. This guide covers required memory configurations (8 DIMMs per CPU and 1TB RAM), hardware prerequisites, and a detailed cost comparison for provisioning SGX/TDX-ready infrastructure.

10 Hugging Face Model Types and Domains that are Perfect for Private AI Infrastructure

A quick list of some of the most popular Hugging Face models / domain types that could benefit from being hosted on private AI infrastructure.

Building an On-Demand GPU Cloud: A Guide for Cloud Resellers Using OpenMetal’s Private GPU Servers

Discover how cloud resellers can offer scalable on-demand GPU services for AI/ML by leveraging OpenMetal’s Private GPU Servers. Learn about GPU time-slicing, MIG, virtualization strategies, and industry trends driving growth—plus key business benefits and real-world use cases.

Don’t Bet Your AI Startup on Public Cloud by Default – Here’s Where Private Infrastructure Wins

Many AI startups default to public cloud and face soaring costs, performance issues, and compliance risks. This article explores how private AI infrastructure delivers predictable pricing, dedicated resources, and better business outcomes—setting you up for success.

Secure and Scalable AI Experimentation with Kasm Workspaces and OpenMetal

In a recent live webinar, OpenMetal’s Todd Robinson sat down with Emrul Islam from Kasm Technologies to explore how container-based Virtual Desktop Infrastructure (VDI) and infrastructure flexibility can empower teams tackling everything from machine learning research to high-security operations.

Announcing the launch of Private AI Labs Program – Up to $50K in infrastructure usage credits

With the new OpenMetal Private AI Labs program, you can access private GPU servers and clusters tailored for your AI projects. By joining, you’ll receive up to $50,000 in usage credits to test, build, and scale your AI workloads.