Paper page - Accelerating LLM Inference with Staged Speculative Decoding #495
Labels
Algorithms
Sorting, Learning or Classifying. All algorithms go here.
llm
Large Language Models
llm-experiments
experiments with large language models
llm-serving-optimisations
Tips, tricks and tools to speedup inference of large language models
MachineLearning
ML Models, Training and Inference
Papers
Research papers
Research
personal research notes for a topic
TIL
Short notes or tips on coding, linux, llms, ml, etc
Paper Page - Accelerating LLM Inference with Staged Speculative Decoding
Published on Aug 9, 2023 | Featured in Daily Papers on Aug 10, 2023
Authors: Benjamin Spector, Chris Re
Abstract
Recent advances with large language models (LLM) have highlighted their diverse capabilities. This paper proposes a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. The algorithm restructures the speculative batch as a tree, reducing generation costs and increasing the expected tokens per batch. Additionally, it introduces a second stage of speculative decoding, further decreasing single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model, all while perfectly preserving output quality.
Read the Paper »
Suggested labels
{ "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }
The text was updated successfully, but these errors were encountered: