Artificial intelligence

Tencent Hunyuan Releases HPC-Ops: A High-Performance LLM Library for Inference Operators

Tencent Hunyuan has open-sourced HPC-Ops, a production-grade library for large-scale object modeling language definitions. HPC-Ops focuses on low-level CUDA kernels for key operators such as Attention, Combined GEMM, and Fused MoE, and exposes them via compact-C and Python APIs for integration into existing logic stacks.

HPC-Ops runs on a large scale of internal services. In that deployment it delivers about a 30 percent query per minute improvement for the Tencent-HY models and a 17 percent improvement for the DeepSeek models on standard index cards. These gains are reported at the service level, so they reflect the impact of stacking faster kernels within the actual logic pipeline.

Scope and design of HPC-Ops

HPC-Ops is a production-grade, high-performance, and easy-to-use LLM library, developed by the Tencent Hunyuan AI Infra team. The project does not attempt to replace distribution frameworks. Instead it provides scripts and clean APIs that can be called from systems that already handle scheduling, KV cache management, mixing, and transport.

The API is designed for seamless use within popular inference frameworks such as vLLM and SGlang. That means the framework team can swap out HPC-Ops kernels after they’ve been compressed without changing the external behavior of their servers.

HPC-Ops uses C++ and CUDA with CuTe and CUTLASS as building blocks. The kernels are written as small examples that also serve as a modern CUDA tutorial.

Kernel performance characteristics

The project publishes the observed acceleration numbers for each operator compared to established baselines. These are microbenchmarks, and the research team emphasizes that performance varies across environments and workloads, but they indicate a ceiling for optimization.

With attention to bf16, compared to FlashInfer, FlashAttention two, FlashAttention three, and TensorRT LLM, HPC Ops reports up to 1.33 times faster in execution and up to 2.22 times in decoding. With attention to fp8, compared to FlashInfer, the third FlashAttention, and TensorRT LLM, it reports up to 1.12 times in precompletion and up to 2.0 times in decoding.

For FusedMoE fp8, compared to TensorRT LLM and vLLM, the maximum speedup observed is up to 1.49 times in prefilling and 1.14 times in decoding. For GroupGEMM fp8, compared to DeepGEMM, the reported gains can be up to 1.1 times in prefilling and 1.88 times in decoding.

These numbers are important because decode is often a latency bottleneck in automated production, where batch sizes are shrinking and memory congestion dominates. The fact that Attention and GroupGEMM show the largest relative gains in decoding suggests that HPC-Ops is focused on the part of the pipeline that most users notice.

Cores based on accuracy

The current release consolidates its functionality into three user families:

  • Attention characters cover both prefilling and recording and include support for page attention. Paging is a memory structure in a framework such as vLLM that uses the key and number of cache blocks in a paged structure, which improves memory reuse in long order.
  • The clustered GEMM is implemented as Quantized GroupGEMM with fp8 weights. HPC-Ops supports block and tensor scaling, so teams can trade off quantization granularity against parameter storage and quantization costs.
  • Fused-MoE combines a combination of path-leading experts and expert calculations in a single-rate operator. It also uses fp8’s expert weights and supports block-wise scaling techniques and individual tensors.

For all of these characters, HPC-Ops provides native support for bf16 and fp8 data types. That’s in line with the current manufacturing trend of moving index to lower precision formats that preserve precision while reducing memory bandwidth and improving tensor core utilization.

Key Takeaways

  • Tencent Hunyuan open-sourced HPC-Ops as a production-grade library for LLM inference on NVIDIA SM90 GPUs, including the H20, with C++ and CUDA characters built into CuTe and CUTLASS.
  • In production deployments HPC-Ops reports about a 30 percent QPM gain for the Tencent-HY models and about a 17 percent QPM gain for the DeepSeek models on standard index cards.
  • Operator microbenchmarks show a speedup of up to 2.22 times for bf16 decoding, up to 2.0 times for fp8 Decoding, up to 1.49 times for fp8 FusedMoE prefilling, and up to 1.88 times for fp8 GroupGEMM decode compared to strong benchmarks such as FlashInfer, FlashInfer, MMM, FlashInfer, and MMM.
  • The library focuses on three user families, Page-based Attention, GroupGEMM with fp8 weights, and Fused MoE with fp8 expert weights, both block-wise and individual tensor scales, and support for native bf16 and fp8 accuracy.
  • HPC-Ops is designed as a user layer that integrates into existing inference frameworks such as vLLM and SGLang, and the roadmap directs a small attention to the long content of LLMs, extended simulations including 4 bit and 8 bit techniques, and better kernels for computing with multi-GPU connections.

Check it out Repo here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button