AI model performance is often measured using the metric tokens per second (TPS). This figure provides a standardized way to understand the speed of model inference across different hardware, models, and configurations.
Understanding Tokens
A token is a unit of text processed by the model. It may represent a word, part of a word, or even punctuation depending on the tokenizer used. For example:
- “Hello” might be a single token.
- “unbelievable” might be split into multiple tokens: “un”, “believ”, “able”.
Tokenization varies by model family, impacting token counts for the same input. This directly affects performance measurements since each token requires computation.
In language models, tokens are essential to how models operate. Models generate output one token at a time, predicting the most likely next token based on the previous context. The definition of a token depends on the tokenizer design, which may be based on byte-pair encoding (BPE), WordPiece, or similar algorithms that balance vocabulary size and sequence length.
For context:
- A typical English word may average 1.3 to 2 tokens depending on complexity.
- Punctuation and formatting, such as spaces or line breaks, are often separate tokens.
- In code generation tasks, tokens may represent parts of keywords, operators, or symbols.
Because token counts directly influence compute time, understanding how your dataset tokenizes is critical for estimating cost and performance.
Measuring Tokens per Second
Tokens per second is the number of tokens a model can process (input and/or output) per second during inference. It is influenced by several factors:
- Model size (number of parameters)
- Hardware (GPU vs. CPU)
- Software optimizations (quantization, inference engines)
When evaluating TPS, it is important to distinguish between:
- Prompt (input) tokens per second — how fast the model reads and processes the input.
- Eval (output) tokens per second — how fast the model generates responses.
Output token generation is compute-intensive and typically slower. Measuring only input TPS can create misleading impressions of performance, as input handling is faster than generation. From our measurements, prompt tokens per second can be as much as 10x higher than eval tokens per second. This separation in performance will vary by model and hardware.
An easy way for a benchmark to skew results is to provide total tokens per second with a large prompt and small output.
Why the Difference Matters
In many workloads, especially text generation, the output size far exceeds the input. For instance:
- A chatbot might process a 100-token prompt but generate a 500-token response.
- Code completion might process a large input context but output only a few tokens.
In these cases, measuring eval TPS provides a more realistic view of latency and throughput than input TPS. Benchmarking systems should clearly state which metric they use.
Model Sizes and Impact on Performance
Model size, measured in parameters, significantly affects performance.
Model Example | Parameters | Typical Eval TPS (Intel Xeon Gold 6530) | Typical Eval TPS (A100) |
---|---|---|---|
llama 3.2 3B | 3 Billion | ~52 TPS | ~171 TPS |
llama 3.1 8B | 8 Billion | ~33 TPS | ~122 TPS |
Larger models generate higher quality outputs but demand more compute power. Smaller models or quantized versions offer better TPS but may reduce output accuracy. If your use case can tolerate lower quality responses, then a cost effective alternative could include running your workloads on a CPU.
Quantization and Performance
Quantization reduces the precision of model weights, typically from 16-bit floating point (FP16) to smaller formats like 4-bit. This is a common strategy for improving AI model inference performance, especially on CPUs or edge devices with limited resources.
Benefits of Quantization:
- Faster Inference: Lower precision reduces the computational load, enabling faster token generation.
- Reduced Memory Usage: Quantized models consume less memory, allowing larger models to fit on limited hardware.
- Lower Power Consumption: The reduced compute demand translates to lower energy usage, which is critical for large-scale deployments.
Quantization Levels and Types:
- INT8 Quantization: One of the most common formats for production environments. Balances speed and acceptable accuracy loss.
- 4-bit Quantization (FP4, INT4): Emerging methods that further compress models. Useful for specialized workloads but may introduce accuracy challenges.
- Dynamic vs. Static Quantization: Dynamic quantization converts weights during inference, while static quantization converts ahead of time, offering better performance.
Trade-Offs:
Quantization may lead to slight accuracy degradation, especially on tasks requiring nuanced understanding or precise outputs. The extent of impact depends on the model architecture, the quantization method, and the dataset.
Quantized models are widely used when deploying large language models on CPUs or when running AI workloads in private clouds where resource efficiency is a priority.
Inferencing Tools and Optimization
Several software tools help optimize performance. These tools maximize hardware utilization, reduce latency, and improve throughput during AI model inference. They allow models to run more efficiently on different hardware platforms, helping organizations lower costs, increase scalability, and meet performance targets for demanding workloads:
- OpenVINO: Intel’s toolkit that boosts AI workloads on CPUs, offering up to 2x speed improvement by using Intel’s AMX, an instruction set designed to improve inference performance.
- Intel PyTorch Extension (IPEX): A newer optimization layer for PyTorch models running on Intel CPUs, achieving up to 25% gains in tests.
- NVIDIA TensorRT: For GPU inference optimization, improving throughput for large models.
- Hugging Face Optimum: Integrates various backends for model acceleration.
Measuring Performance in Real-World Scenarios
Concurrency Considerations
Concurrent inference requests affect model performance, particularly on shared hardware:
- On CPUs, high concurrency leads to contention for cores and memory bandwidth.
- On GPUs, concurrency is limited by available CUDA cores and VRAM.
Planning for concurrency is essential in multi-user environments or APIs that require rapid response times.
Model Size and Token Ratio Impact
The ratio of input tokens to output tokens varies by use case:
- Conversational AI often has a low input-to-output ratio (short prompts, long responses).
- Search or summarization tasks may have large input contexts but produce small outputs.
Measuring eval TPS ensures the most compute-heavy part of the process is tested. Benchmarking should also document the token ratio used in tests.
Use-Case Specific Targets
Real-world requirements vary:
- Code completion needs low latency and may target 50–100 TPS per user.
- Batch document summarization tolerates slower TPS but requires throughput efficiency.
- RAG pipelines combining search and generation may mix large inputs and long outputs, stressing both prompt and eval performance.
Conclusion
Accurately measuring AI model performance requires a focus on tokens per second, specifically output generation rates. Understanding tokenization, model size, quantization, and inference tool selection is essential for comparing hardware and software environments. Clear benchmarks enable better capacity planning, especially when deploying AI workloads on private clouds or bare metal.
Interested in OpenMetal Cloud?
Schedule a Consultation
Get a deeper assessment and discuss your unique requirements.
Read More From OpenMetal