How to start and use the Leaderboard to filter and compare benchmark results
The Inference Engine Arena Leaderboard provides a powerful way to compare benchmark results across different inference engines, models, and hardware configurations. This guide will walk you through using the leaderboard effectively.
Local Leaderboard
Start the local leaderboard server from the command line:
Expected output:
Once the server is running, open your browser and navigate to the local URL or using public URL. You can share the public URL to allow others to access it remotely.
The leaderboard provides powerful filtering capabilities to help you focus on relevant benchmark results.
Filter Panel
The filter panel on the left side of the leaderboard allows you to narrow down results by:
Show all precision
Show details
There are three checkboxes here:
Advanced filters
As you select filters, the leaderboard updates in real-time to show only matching results.
The main leaderboard view displays results in a comprehensive table format. You can also click the button of Show details
to see some detailed information of the selected sub-run.
Table View
The results table shows key metrics for each benchmark run:
Columns include:
Show all precision
And if you select Show details
, you can see the more information of these sub-runs as follows:
random-input-len
, random-output-len
, etcThe scatter plot view provides a powerful way to visualize relationships between different arguments, and it is synchronized in real time with the filter results above.
Scatter Plot
The scatter plot view provides an interactive visualization of benchmark results, accessible below the table view:
Each scatter plot displays:
In the example above, we’re comparing vLLM and SGLang performance on NVIDIA H100 80GB HBM3(4x) GPU running the summarization benchmark with the NousResearch/Meta-Llama-3.1-8B model. The visualization clearly shows four distinct performance curves:
This visualization enables you to:
Our visualization methodology is inspired by NVIDIA CEO Jensen Huang’s GTC March 2025 Keynote presentation, which demonstrated the effectiveness of this approach for performance analysis:
The global leaderboard connects you with the broader inference engine community. Visit https://iearena.org/ to view and share your benchmark results with others.
Community Results
The global leaderboard aggregates benchmark results from users worldwide, and it also provides more filtering options than the local leaderboard:
In addition to the three checkboxes mentioned in the local leaderboard, this section includes an additional one: Show only results from verified sources
:
On the global leaderboard, you can:
Inference Engine Arena collects a comprehensive set of metrics to help you evaluate and compare the performance of different inference engines. This guide explains the key metrics and how to interpret them.
How fast can the engine process tokens?
How responsive is the engine to requests?
How efficiently does the engine use GPU memory?
How well does the engine handle multiple simultaneous requests?
Input Throughput
Input Throughput measures how quickly the engine can process input tokens (the prompt text).
This metric is particularly important for workloads with long prompts or context windows.
Output Throughput
Output Throughput measures how quickly the engine can generate new tokens (the response text).
This metric is crucial for generation-heavy workloads where response speed matters.
Total Throughput
Total Throughput measures the combined rate of processing input and generating output tokens.
This metric provides an overall view of the engine’s processing capacity.
Request Throughput
Request Throughput measures how many complete requests the engine can handle per second.
This is particularly important for high-traffic applications.
Time to First Token (TTFT)
Time to First Token (TTFT) measures how long it takes from sending a request until receiving the first token of the response.
TTFT is critical for interactive applications where user experience depends on perceived responsiveness.
Time Per Output Token (TPOT)
Time Per Output Token (TPOT) measures the average time it takes to generate each output token.
TPOT determines how smoothly text is generated after the initial response begins.
End-to-End Latency
End-to-End Latency measures the total time from sending a request until receiving the complete response.
This metric is important for understanding the overall user experience.
Peak GPU Memory Usage
Peak GPU Memory Usage measures the maximum amount of GPU memory used during inference.
This metric helps determine hardware requirements and how many models can fit on a single GPU.
Memory Efficiency
Memory Efficiency is calculated as throughput per GB of memory used.
This metric helps compare how efficiently different engines use the available GPU memory.
Scaling Efficiency
Scaling Efficiency measures how throughput increases as concurrency increases.
This metric helps understand how well the engine utilizes parallelism.
Max Effective Concurrency
Max Effective Concurrency is the concurrency level beyond which additional concurrent requests no longer improve throughput.
This metric helps determine the optimal concurrency setting for your workload.
When analyzing benchmark results, consider:
Workload Characteristics: Different engines excel at different types of workloads. Match the metrics that matter most to your specific use case.
Hardware Utilization: Check how efficiently each engine utilizes your hardware. Some engines may perform better on specific GPU architectures.
Trade-offs: There’s often a trade-off between throughput and latency. Decide which is more important for your application.
Scaling: Look at how performance scales with concurrency to understand how the engine will behave under load.
Inference Engine Arena provides various ways to visualize metrics: