
QuickReduce: Up to 3x Faster All-reduce for vLLM and SGLang
Aug 26, 2025 · Therefore, in vLLM and SGLang, we implemented an adaptive algorithm that automatically selects the fastest among custom all-reduce, RCCL, and QuickReduce based on …
3D Scene Reconstruction from the Inside: Explore the Mathematics …
Dec 16, 2025 · We will start with explaining how 3D points in a scene become 3D Gaussians, how they project to 2D splats to produce images, and how rendering, loss functions, and adaptive …
Speculative Decoding - Deep Dive — ROCm Blogs
Mar 24, 2025 · This blog shows the performance improvement achieved by applying speculative decoding with Llama models on AMD MI300X GPUs, tested across models, input sizes, and …
Step-Video-T2V Inference with xDiT on AMD Instinct MI300X GPUs
May 15, 2025 · Text Encoder: Utilizes two bilingual encoders — Hunyuan-CLIP, a bidirectional encoder aligning text and visual features, and Step-LLM, a unidirectional encoder without input …
GEMM Kernel Optimization For AMD GPUs — ROCm Blogs
Feb 6, 2025 · Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural …
Practical, Fault‑Robust Distributed Inference for DeepSeek on …
Nov 12, 2025 · Future Work # Our future efforts will focus on enhancing parallel efficiency and reducing communication–computation latency in large-scale MoE inference. In particular, we …
Applications & models — ROCm Blogs
Jan 8, 2026 · Applications & models # Explore the latest blogs about applications and models in the ROCm ecosystem, including machine learning frameworks, AI models, and application …
ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency, …
Sep 16, 2025 · User & Project Quota Management: Ensure fair and efficient compute distribution across teams with smart quota enforcement and adaptive scaling. Telemetry & Analytics: Gain …
High-Throughput BERT-L Pre-Training on AMD Instinct™ GPUs: A …
Jun 3, 2025 · The training uses distributed fused LAMB (Layer-wise Adaptive Moments) optimizer, combined with a linear warmup and polynomial decay learning rate scheduler. The key training …
AITER: AI Tensor Engine For ROCm — ROCm Blogs
Mar 21, 2025 · We introduce AMD's AI Tensor Engine for ROCm (AITER), our centralized high performance AI operators repository, designed to significantly accelerate AI workloads on …