About 50 results
Open links in new tab
  1. QuickReduce: Up to 3x Faster All-reduce for vLLM and SGLang

    Aug 26, 2025 · Therefore, in vLLM and SGLang, we implemented an adaptive algorithm that automatically selects the fastest among custom all-reduce, RCCL, and QuickReduce based on …

  2. 3D Scene Reconstruction from the Inside: Explore the Mathematics …

    Dec 16, 2025 · We will start with explaining how 3D points in a scene become 3D Gaussians, how they project to 2D splats to produce images, and how rendering, loss functions, and adaptive …

  3. Speculative Decoding - Deep Dive — ROCm Blogs

    Mar 24, 2025 · This blog shows the performance improvement achieved by applying speculative decoding with Llama models on AMD MI300X GPUs, tested across models, input sizes, and …

  4. Step-Video-T2V Inference with xDiT on AMD Instinct MI300X GPUs

    May 15, 2025 · Text Encoder: Utilizes two bilingual encoders — Hunyuan-CLIP, a bidirectional encoder aligning text and visual features, and Step-LLM, a unidirectional encoder without input …

  5. GEMM Kernel Optimization For AMD GPUs — ROCm Blogs

    Feb 6, 2025 · Matrix multiplication underlies critical computational pathways in AI, with General Matrix Multiplication (GEMM) operations serving as performance-critical kernels in neural …

  6. Practical, Fault‑Robust Distributed Inference for DeepSeek on …

    Nov 12, 2025 · Future Work # Our future efforts will focus on enhancing parallel efficiency and reducing communication–computation latency in large-scale MoE inference. In particular, we …

  7. Applications & models — ROCm Blogs

    Jan 8, 2026 · Applications & models # Explore the latest blogs about applications and models in the ROCm ecosystem, including machine learning frameworks, AI models, and application …

  8. ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency, …

    Sep 16, 2025 · User & Project Quota Management: Ensure fair and efficient compute distribution across teams with smart quota enforcement and adaptive scaling. Telemetry & Analytics: Gain …

  9. High-Throughput BERT-L Pre-Training on AMD Instinct™ GPUs: A …

    Jun 3, 2025 · The training uses distributed fused LAMB (Layer-wise Adaptive Moments) optimizer, combined with a linear warmup and polynomial decay learning rate scheduler. The key training …

  10. AITER: AI Tensor Engine For ROCm — ROCm Blogs

    Mar 21, 2025 · We introduce AMD's AI Tensor Engine for ROCm (AITER), our centralized high performance AI operators repository, designed to significantly accelerate AI workloads on …