Measuring AI Model Performance: Tokens per Second, Model Sizes, and Inferencing Tools

Resources » Blog » Measuring AI Model Performance: Tokens per Second, Model Sizes, and Inferencing Tools

AI model performance is often measured using the metric tokens per second (TPS). This figure provides a standardized way to understand the speed of model inference across different hardware, models, and configurations.

Understanding Tokens

A token is a unit of text processed by the model. It may represent a word, part of a word, or even punctuation depending on the tokenizer used. For example:

“Hello” might be a single token.
“unbelievable” might be split into multiple tokens: “un”, “believ”, “able”.

Tokenization varies by model family, impacting token counts for the same input. This directly affects performance measurements since each token requires computation.

In language models, tokens are essential to how models operate. Models generate output one token at a time, predicting the most likely next token based on the previous context. The definition of a token depends on the tokenizer design, which may be based on byte-pair encoding (BPE), WordPiece, or similar algorithms that balance vocabulary size and sequence length.

For context:

A typical English word may average 1.3 to 2 tokens depending on complexity.
Punctuation and formatting, such as spaces or line breaks, are often separate tokens.
In code generation tasks, tokens may represent parts of keywords, operators, or symbols.

Because token counts directly influence compute time, understanding how your dataset tokenizes is critical for estimating cost and performance.

Measuring Tokens per Second

Tokens per second is the number of tokens a model can process (input and/or output) per second during inference. It is influenced by several factors:

Model size (number of parameters)
Hardware (GPU vs. CPU)
Software optimizations (quantization, inference engines)

When evaluating TPS, it is important to distinguish between:

Prompt (input) tokens per second — how fast the model reads and processes the input.
Eval (output) tokens per second — how fast the model generates responses.

Output token generation is compute-intensive and typically slower. Measuring only input TPS can create misleading impressions of performance, as input handling is faster than generation. From our measurements, prompt tokens per second can be as much as 10x higher than eval tokens per second. This separation in performance will vary by model and hardware.

An easy way for a benchmark to skew results is to provide total tokens per second with a large prompt and small output.

Why the Difference Matters

In many workloads, especially text generation, the output size far exceeds the input. For instance:

A chatbot might process a 100-token prompt but generate a 500-token response.
Code completion might process a large input context but output only a few tokens.

In these cases, measuring eval TPS provides a more realistic view of latency and throughput than input TPS. Benchmarking systems should clearly state which metric they use.

Model Sizes and Impact on Performance

Model size, measured in parameters, significantly affects performance.

Model Example	Parameters	Typical Eval TPS (Intel Xeon Gold 6530)	Typical Eval TPS (A100)
llama 3.2 3B	3 Billion	~52 TPS	~171 TPS
llama 3.1 8B	8 Billion	~33 TPS	~122 TPS

Larger models generate higher quality outputs but demand more compute power. Smaller models or quantized versions offer better TPS but may reduce output accuracy. If your use case can tolerate lower quality responses, then a cost effective alternative could include running your workloads on a CPU.

Quantization and Performance

Quantization reduces the precision of model weights, typically from 16-bit floating point (FP16) to smaller formats like 4-bit. This is a common strategy for improving AI model inference performance, especially on CPUs or edge devices with limited resources.

Benefits of Quantization:

Faster Inference: Lower precision reduces the computational load, enabling faster token generation.
Reduced Memory Usage: Quantized models consume less memory, allowing larger models to fit on limited hardware.
Lower Power Consumption: The reduced compute demand translates to lower energy usage, which is critical for large-scale deployments.

Quantization Levels and Types:

INT8 Quantization: One of the most common formats for production environments. Balances speed and acceptable accuracy loss.
4-bit Quantization (FP4, INT4): Emerging methods that further compress models. Useful for specialized workloads but may introduce accuracy challenges.
Dynamic vs. Static Quantization: Dynamic quantization converts weights during inference, while static quantization converts ahead of time, offering better performance.

Trade-Offs:

Quantization may lead to slight accuracy degradation, especially on tasks requiring nuanced understanding or precise outputs. The extent of impact depends on the model architecture, the quantization method, and the dataset.

Quantized models are widely used when deploying large language models on CPUs or when running AI workloads in private clouds where resource efficiency is a priority.

Inferencing Tools and Optimization

Several software tools help optimize performance. These tools maximize hardware utilization, reduce latency, and improve throughput during AI model inference. They allow models to run more efficiently on different hardware platforms, helping organizations lower costs, increase scalability, and meet performance targets for demanding workloads:

OpenVINO: Intel’s toolkit that boosts AI workloads on CPUs, offering up to 2x speed improvement by using Intel’s AMX, an instruction set designed to improve inference performance.
Intel PyTorch Extension (IPEX): A newer optimization layer for PyTorch models running on Intel CPUs, achieving up to 25% gains in tests.
NVIDIA TensorRT: For GPU inference optimization, improving throughput for large models.
Hugging Face Optimum: Integrates various backends for model acceleration.

Measuring Performance in Real-World Scenarios

Concurrency Considerations

Concurrent inference requests affect model performance, particularly on shared hardware:

On CPUs, high concurrency leads to contention for cores and memory bandwidth.
On GPUs, concurrency is limited by available CUDA cores and VRAM.

Planning for concurrency is essential in multi-user environments or APIs that require rapid response times.

Model Size and Token Ratio Impact

The ratio of input tokens to output tokens varies by use case:

Conversational AI often has a low input-to-output ratio (short prompts, long responses).
Search or summarization tasks may have large input contexts but produce small outputs.

Measuring eval TPS ensures the most compute-heavy part of the process is tested. Benchmarking should also document the token ratio used in tests.

Use-Case Specific Targets

Real-world requirements vary:

Code completion needs low latency and may target 50–100 TPS per user.
Batch document summarization tolerates slower TPS but requires throughput efficiency.
RAG pipelines combining search and generation may mix large inputs and long outputs, stressing both prompt and eval performance.

Conclusion

Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments. Clear benchmarks enable better capacity planning, especially when deploying AI workloads on private clouds or bare metal.

Interested in GPU Servers and Clusters?

GPU Server Pricing

High-performance GPU hardware with detailed specs and transparent pricing.

View Options

Schedule a Consultation

Let’s discuss your GPU or AI needs and tailor a solution that fits your goals.

Schedule Meeting

Private AI Labs

$50k in credits to accelerate your AI project in a secure, private environment.

Apply Now

Measuring AI Model Performance: Tokens per Second, Model Sizes, and Inferencing Tools

Understanding Tokens

Measuring Tokens per Second

Why the Difference Matters

Model Sizes and Impact on Performance

Quantization and Performance

Benefits of Quantization:

Quantization Levels and Types:

Trade-Offs:

Inferencing Tools and Optimization

Measuring Performance in Real-World Scenarios

Concurrency Considerations

Model Size and Token Ratio Impact

Use-Case Specific Targets

Conclusion

Interested in GPU Servers and Clusters?

GPU Server Pricing

Schedule a Consultation

Private AI Labs

Building a Scalable MLOps Platform from Scratch on OpenMetal

AI Use Case: Hosting OpenAI Whisper on a Private GPU Cloud – A Strategic Advantage for Media Companies

AI Use Case: Hosting BioGPT on a Private GPU Cloud for Biomedical NLP

MicroVMs: Scaling Out Over Scaling Up in Modern Cloud Architectures

Enabling Intel SGX and TDX on OpenMetal v4 Servers: Hardware Requirements

10 Hugging Face Model Types and Domains that are Perfect for Private AI Infrastructure

Building an On-Demand GPU Cloud: A Guide for Cloud Resellers Using OpenMetal’s Private GPU Servers

Don’t Bet Your AI Startup on Public Cloud by Default – Here’s Where Private Infrastructure Wins

Secure and Scalable AI Experimentation with Kasm Workspaces and OpenMetal

Announcing the launch of Private AI Labs Program – Up to $50K in infrastructure usage credits