Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper page - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling #961

Open
1 task
ShellLM opened this issue Dec 21, 2024 · 1 comment
Labels
ai-leaderboards leaderdoards for llm's and other ml models code-generation code generation models and tools like copilot and aider llm Large Language Models llm-experiments experiments with large language models MachineLearning ML Models, Training and Inference ml-inference Running and serving ML models. New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Dec 21, 2024

Paper page - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Snippet

"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Published on Jul 31
Authors:

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, Azalia Mirhoseini

Abstract

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget."

URL

https://huggingface.co/papers/2407.21787

Suggested labels

{'label-name': 'scaling-inference', 'label-description': 'Explores inference compute scaling with repeated sampling for language models.', 'gh-repo': 'papers', 'confidence': 62.25}

@ShellLM ShellLM added ai-leaderboards leaderdoards for llm's and other ml models code-generation code generation models and tools like copilot and aider llm Large Language Models llm-experiments experiments with large language models MachineLearning ML Models, Training and Inference ml-inference Running and serving ML models. New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers labels Dec 21, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Dec 21, 2024

Related content

#897 similarity score: 0.87
#456 similarity score: 0.86
#758 similarity score: 0.86
#686 similarity score: 0.86
#507 similarity score: 0.85
#769 similarity score: 0.85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai-leaderboards leaderdoards for llm's and other ml models code-generation code generation models and tools like copilot and aider llm Large Language Models llm-experiments experiments with large language models MachineLearning ML Models, Training and Inference ml-inference Running and serving ML models. New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers
Projects
None yet
Development

No branches or pull requests

1 participant