AI model performance is often measured using the metric tokens per second (TPS). This figure provides a standardized way to understand the speed of model inference across different hardware, models, and configurations.

Understanding Tokens

A token is a unit of text processed by the model. It may represent a word, part of a word, or even punctuation depending on the tokenizer used. For example:

  • “Hello” might be a single token.
  • “unbelievable” might be split into multiple tokens: “un”, “believ”, “able”.

Tokenization varies by model family, impacting token counts for the same input. This directly affects performance measurements since each token requires computation.

In language models, tokens are essential to how models operate. Models generate output one token at a time, predicting the most likely next token based on the previous context. The definition of a token depends on the tokenizer design, which may be based on byte-pair encoding (BPE), WordPiece, or similar algorithms that balance vocabulary size and sequence length.

For context:

  • A typical English word may average 1.3 to 2 tokens depending on complexity.
  • Punctuation and formatting, such as spaces or line breaks, are often separate tokens.
  • In code generation tasks, tokens may represent parts of keywords, operators, or symbols.

Because token counts directly influence compute time, understanding how your dataset tokenizes is critical for estimating cost and performance.

Measuring Tokens per Second

Tokens per second is the number of tokens a model can process (input and/or output) per second during inference. It is influenced by several factors:

  • Model size (number of parameters)
  • Hardware (GPU vs. CPU)
  • Software optimizations (quantization, inference engines)

When evaluating TPS, it is important to distinguish between:

  • Prompt (input) tokens per second — how fast the model reads and processes the input.
  • Eval (output) tokens per second — how fast the model generates responses.

Output token generation is compute-intensive and typically slower. Measuring only input TPS can create misleading impressions of performance, as input handling is faster than generation. From our measurements, prompt tokens per second can be as much as 10x higher than eval tokens per second. This separation in performance will vary by model and hardware.

An easy way for a benchmark to skew results is to provide total tokens per second with a large prompt and small output.

Why the Difference Matters

In many workloads, especially text generation, the output size far exceeds the input. For instance:

  • A chatbot might process a 100-token prompt but generate a 500-token response.
  • Code completion might process a large input context but output only a few tokens.

In these cases, measuring eval TPS provides a more realistic view of latency and throughput than input TPS. Benchmarking systems should clearly state which metric they use.

Model Sizes and Impact on Performance

Model size, measured in parameters, significantly affects performance. 

Model ExampleParameters

Typical Eval TPS (Intel Xeon Gold 6530)

Typical Eval TPS (A100)
llama 3.2 3B3 Billion~52 TPS~171 TPS
llama 3.1 8B8 Billion~33 TPS~122 TPS

Larger models generate higher quality outputs but demand more compute power. Smaller models or quantized versions offer better TPS but may reduce output accuracy. If your use case can tolerate lower quality responses, then a cost effective alternative could include running your workloads on a CPU.

Quantization and Performance

Quantization reduces the precision of model weights, typically from 16-bit floating point (FP16) to smaller formats like 4-bit. This is a common strategy for improving AI model inference performance, especially on CPUs or edge devices with limited resources.

Benefits of Quantization:

  • Faster Inference: Lower precision reduces the computational load, enabling faster token generation.
  • Reduced Memory Usage: Quantized models consume less memory, allowing larger models to fit on limited hardware.
  • Lower Power Consumption: The reduced compute demand translates to lower energy usage, which is critical for large-scale deployments.

Quantization Levels and Types:

  • INT8 Quantization: One of the most common formats for production environments. Balances speed and acceptable accuracy loss.
  • 4-bit Quantization (FP4, INT4): Emerging methods that further compress models. Useful for specialized workloads but may introduce accuracy challenges.
  • Dynamic vs. Static Quantization: Dynamic quantization converts weights during inference, while static quantization converts ahead of time, offering better performance.

Trade-Offs:

Quantization may lead to slight accuracy degradation, especially on tasks requiring nuanced understanding or precise outputs. The extent of impact depends on the model architecture, the quantization method, and the dataset.

Quantized models are widely used when deploying large language models on CPUs or when running AI workloads in private clouds where resource efficiency is a priority.

Inferencing Tools and Optimization

Several software tools help optimize performance. These tools maximize hardware utilization, reduce latency, and improve throughput during AI model inference. They allow models to run more efficiently on different hardware platforms, helping organizations lower costs, increase scalability, and meet performance targets for demanding workloads:

  • OpenVINO: Intel’s toolkit that boosts AI workloads on CPUs, offering up to 2x speed improvement by using Intel’s AMX, an instruction set designed to improve inference performance.
  • Intel PyTorch Extension (IPEX): A newer optimization layer for PyTorch models running on Intel CPUs, achieving up to 25% gains in tests.
  • NVIDIA TensorRT: For GPU inference optimization, improving throughput for large models.
  • Hugging Face Optimum: Integrates various backends for model acceleration.

Measuring Performance in Real-World Scenarios

Concurrency Considerations

Concurrent inference requests affect model performance, particularly on shared hardware:

  • On CPUs, high concurrency leads to contention for cores and memory bandwidth.
  • On GPUs, concurrency is limited by available CUDA cores and VRAM.

Planning for concurrency is essential in multi-user environments or APIs that require rapid response times.

Model Size and Token Ratio Impact

The ratio of input tokens to output tokens varies by use case:

  • Conversational AI often has a low input-to-output ratio (short prompts, long responses).
  • Search or summarization tasks may have large input contexts but produce small outputs.

Measuring eval TPS ensures the most compute-heavy part of the process is tested. Benchmarking should also document the token ratio used in tests.

Use-Case Specific Targets

Real-world requirements vary:

  • Code completion needs low latency and may target 50–100 TPS per user.
  • Batch document summarization tolerates slower TPS but requires throughput efficiency.
  • RAG pipelines combining search and generation may mix large inputs and long outputs, stressing both prompt and eval performance.

Conclusion

Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments. Clear benchmarks enable better capacity planning, especially when deploying AI workloads on private clouds or bare metal.

Interested in OpenMetal Cloud?

Chat With Our Team

We’re available to answer questions and provide information.

Chat With Us

Schedule a Consultation

Get a deeper assessment and discuss your unique requirements.

Schedule Consultation

Try It Out

Take a peek under the hood of our cloud platform or launch a trial.

Trial Options

Read More From OpenMetal

Solving AI’s Most Pressing Deployment Challenges: Secure Collaboration, Infrastructure Sprawl, and Scalable Experimentation

Explore real-world solutions to AI deployment challenges—from managing secure, container-based environments to scaling GPU infrastructure efficiently. Learn how Kasm Workspaces and OpenMetal enable secure collaboration, cost control, and streamlined experimentation for AI teams.

Measuring AI Model Performance: Tokens per Second, Model Sizes, and Inferencing Tools

Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments.

Scaling AI with Open Infra: OpenMetal’s Perspective on the Future of Open Source AI Infrastructure

This article highlights OpenMetal’s perspective on AI infrastructure, as shared by Todd Robinson at OpenInfra Days 2025. It explores how OpenInfra, particularly OpenStack, enables scalable, cost-efficient AI workloads while avoiding hyperscaler lock-in.

Unlocking Private AI: CPU vs. GPU Inference (SCaLE 22x and OpenInfra Days 2025)

At OpenMetal, you can deploy AI models on your own infrastructure, balancing CPU vs. GPU inference for cost and performance, and maintaining full control over data privacy.

10 Essential AI Tools for WordPress Agencies: Transforming Workflows, Design, and Client Solutions

10 essential AI tools WordPress agencies can explore to streamline workflows, enhance customer operations, and stay competitive.

Practical Ways WordPress Agencies Can Harness AI Today

This article offers insights as to how WordPress agencies can gain a competitive edge by embracing AI innovation.

Confidential Computing: Enhancing Data Privacy and Security in Cloud Environments

Learn about the need for confidential computing, its benefits, and some top industries benefiting from this technology.

The Growing Bare Metal Cloud Market: A Surge Driven by AI, ML, and High-Performance Computing

The bare metal cloud market is poised for significant growth in the coming years, fueled by the rapid advancements in artificial intelligence (AI) and machine learning (ML), as well as the increasing demand for high-performance computing (HPC).

OpenStack For High Performance Computing

High performance computing refers to the use of powerful computers and parallel processing techniques to solve complex computational problems. HPC is typically used for tasks that can include: running larging-scale simulations, financial models, big data analytics and AI which  require considerable processing power, memory and storage. Private OpenStack clouds offer several key features such as scalability, flexibility, integration and cost-efficiency that make them suitable to for running HPC workloads. 

Beginner’s Guide To Understanding AI

ELIZA (created by Joseph Weizenbaum) was one of the first natural language processing programs designed to simulate conversation. But it wasn’t until nearly 50 years later that AI tools like Siri allowed broad adoption so that many users could use natural language to interact with their devices. And while Siri had its time, never before have we had this many significant consecutive releases (ChatGPT, Gemini, Alexa) that transformed our daily lives. AI now plays a pivotal role in our daily lives, answering our questions, articulating and documenting our ideas in written and even artistic forms, driving our vehicles, cooking our food, analyzing our patterns, and recommending ads to us, etc. So what do you need to know about AI? And where do you begin if you want to leverage the endless possibilities of AI for your organization?