Paper page - Accelerating LLM Inference with Staged Speculative Decoding #495

irthomasthomas · 2024-02-01T10:08:57Z

Paper page - Accelerating LLM Inference with Staged Speculative Decoding

Paper Page - Accelerating LLM Inference with Staged Speculative Decoding

Published on Aug 9, 2023 | Featured in Daily Papers on Aug 10, 2023

Abstract

Recent advances with large language models (LLM) have highlighted their diverse capabilities. This paper proposes a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. The algorithm restructures the speculative batch as a tree, reducing generation costs and increasing the expected tokens per batch. Additionally, it introduces a second stage of speculative decoding, further decreasing single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model, all while perfectly preserving output quality.

Read the Paper »

Suggested labels

{ "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }

This was referenced Feb 28, 2024

At the Intersection of LLMs and Kernels - Research Roundup #655

Open

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

Open

ShellLM mentioned this issue Apr 12, 2024

Inference with Reference: Lossless Acceleration of Large Language Models by Nan Yang et al. #803

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper page - Accelerating LLM Inference with Staged Speculative Decoding #495

Paper page - Accelerating LLM Inference with Staged Speculative Decoding #495

irthomasthomas commented Feb 1, 2024

Paper page - Accelerating LLM Inference with Staged Speculative Decoding #495

Paper page - Accelerating LLM Inference with Staged Speculative Decoding #495

Comments

irthomasthomas commented Feb 1, 2024

Paper Page - Accelerating LLM Inference with Staged Speculative Decoding

Suggested labels

{ "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }