-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardizing Critique Prompts for CRITIC Benchmarks #187
Conversation
Important Review skippedReview was skipped as selected files did not have any reviewable changes. Files selected but had no reviewable changes (1)
You can disable this status message by setting the WalkthroughThe modifications introduce enhancements across multiple benchmarking scripts related to problem-solving, Q&A examples, and Python code snippets. These changes focus on adding new examples, refining existing responses, fixing bugs, and introducing new functionality to improve the variety and accuracy of the examples. This enriches the datasets and ensures more robust and comprehensive training and testing capabilities. Changes
Poem
Tip Early access features: enabledWe are currently testing the following features in early access:
Note:
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (1 hunks)
Additional comments not posted (1)
agential/cog/prompts/agent/critic.py (1)
113-137
: Verify the accuracy and plausibility of the new constant's content.The new constant
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING
contains a detailed critique of a proposed answer about men's figure skating in the Olympics. It's crucial to ensure that the historical facts presented are accurate and that the logic of the critique aligns with the intended educational or informational purpose of the content.Verification successful
Historical Accuracy of AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING Verified
The historical claims in the constant
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING
have been verified:
- Claim 1: Men's figure skating has been part of the Summer Olympics in 1908 and 1920. It is no longer a Summer Olympic sport.
- Claim 2: Men's figure skating was part of the 1908 Summer Olympics and has been part of the Winter Olympics since 1924.
The provided information in the constant is accurate and aligns with historical records.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the historical accuracy of the content in the new constant. # Test: Search for historical references online. Expect: Accurate matches with the content. echo "Please manually verify the historical accuracy of the content in AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING by consulting reliable historical sources or databases."Length of output: 330
Script:
#!/bin/bash # Description: Verify historical accuracy of figure skating in the Olympics. # Verify claim 1: Men's figure skating has never been a summer Olympic sport. echo "Verifying claim 1: Men's figure skating has never been a summer Olympic sport." curl -s "https://en.wikipedia.org/wiki/Figure_skating_at_the_Olympic_Games" | grep -i "summer" # Verify claim 2: Men's figure skating was part of the 1908 Summer Olympics and has been part of the Winter Olympics since 1924. echo "Verifying claim 2: Men's figure skating was part of the 1908 Summer Olympics and has been part of the Winter Olympics since 1924." curl -s "https://en.wikipedia.org/wiki/Figure_skating_at_the_Olympic_Games" | grep -i "1908" curl -s "https://en.wikipedia.org/wiki/Figure_skating_at_the_Olympic_Games" | grep -i "1924"Length of output: 39825
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- agential/cog/prompts/agent/critic.py (2 hunks)
- test.ipynb (1 hunks)
Files skipped from review due to trivial changes (1)
- test.ipynb
Files skipped from review as they are similar to previous changes (1)
- agential/cog/prompts/agent/critic.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (2)
agential/cog/prompts/benchmark/ambignq.py (2)
Line range hint
8-41
: Ensure Consistency in Capitalization and PunctuationThe questions and answers in the
AMBIGNQ_FEWSHOT_EXAMPLES_COT
constant should maintain consistent capitalization and punctuation. For instance, "washington" should be capitalized as "Washington", and each entry should end with proper punctuation for clarity and professionalism.- Q: When is the date the last time washington won stanley cup? + Q: When is the date the last time Washington won Stanley Cup?
Line range hint
86-158
: Improve Clarity and Completeness in Reactive ExamplesThe
AMBIGNQ_FEWSHOT_EXAMPLES_REACT
constant contains a detailed step-by-step search process. However, some entries are incomplete, such as missing observations or thoughts. It's crucial to complete these sections to provide a full context and enhance the educational value of the examples.- Thought 2: + Thought 2: The evidence clearly supports the 2018 victory date.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- agential/cog/prompts/agent/critic.py (2 hunks)
- agential/cog/prompts/benchmark/ambignq.py (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- agential/cog/prompts/agent/critic.py
Additional comments not posted (1)
agential/cog/prompts/benchmark/ambignq.py (1)
Line range hint
48-81
: Check for Consistency and Clarity in Direct ExamplesThe
AMBIGNQ_FEWSHOT_EXAMPLES_DIRECT
constant contains concise examples. Ensure that all examples are clear and direct, maintaining a uniform format across all entries. This will enhance readability and ease of understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/benchmark/hotpotqa.py (1 hunks)
Files not reviewed due to errors (1)
- agential/cog/prompts/benchmark/hotpotqa.py (no review received)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/benchmark/ambignq.py (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- agential/cog/prompts/benchmark/ambignq.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- agential/cog/prompts/agent/critic.py (6 hunks)
- agential/cog/prompts/benchmark/gsm8k.py (1 hunks)
- agential/cog/prompts/benchmark/mbpp.py (2 hunks)
- agential/cog/prompts/benchmark/svamp.py (1 hunks)
- agential/cog/prompts/benchmark/tabmwp.py (1 hunks)
- agential/cog/prompts/benchmark/triviaqa.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- agential/cog/prompts/agent/critic.py
Additional comments not posted (10)
agential/cog/prompts/benchmark/triviaqa.py (3)
3-15
: Well-structured detailed few-shot examples for TriviaQA.The examples provided are comprehensive and align well with the intent to enhance the dataset with more contextual information. Great job on ensuring that each answer is detailed and informative.
17-29
: Concise few-shot examples for TriviaQA.The concise format of these examples is well-executed, providing straightforward answers to the questions. This format is useful for quick reference and aligns with the PR's goal of expanding the dataset.
31-55
: Interactive and engaging few-shot examples for TriviaQA.The REACT format effectively demonstrates a thought process and actions, which can be highly beneficial for training AI models to understand and generate human-like reasoning. This addition is innovative and aligns well with the PR's objectives.
agential/cog/prompts/benchmark/tabmwp.py (1)
3-102
: Comprehensive and educational few-shot examples involving tables and Python code.These examples are well-crafted to demonstrate how to interpret table data and write Python code to solve specific problems. This approach not only aligns with the PR's objectives but also adds significant educational value to the dataset.
agential/cog/prompts/benchmark/svamp.py (1)
3-44
: Clear and correctly implemented few-shot examples for solving word problems.The Python code snippets are correctly implemented and solve the word problems as stated. This addition enhances the dataset by providing practical examples of how to apply Python programming to solve real-world problems.
agential/cog/prompts/benchmark/mbpp.py (1)
Line range hint
1-91
: Well-implemented Python programming tasks with accurate solutions.The tasks are clearly defined, and the solutions are correctly implemented, including the corrected
is_not_prime
function. These additions and corrections significantly enhance the dataset's educational value for Python programmers.agential/cog/prompts/benchmark/gsm8k.py (4)
3-18
: Well-structured and correct calculation for wages and taxes.The code is clear, variables are well-named, and the logic accurately reflects the problem statement. Good job on this example!
21-36
: Correct and well-documented algebraic solution.The step-by-step comments enhance readability and understanding. The logic and calculations are accurate.
39-50
: Accurate calculation of uneaten pizza pieces.This example clearly demonstrates the use of arithmetic operations to solve the problem. Variables are descriptive and the logic is sound.
53-71
: Effective use of mathematical functions for cost calculation.This example is well-implemented with a practical application of
math.ceil
and decision-making to find the cheapest option. The calculations and logic are correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (7 hunks)
Additional comments not posted (2)
agential/cog/prompts/agent/critic.py (2)
113-113
: Ensure correct naming convention and usage of the constant.The constant
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
is well-named, following Python uppercase convention for constants. It appears to be used correctly in the context of providing examples for critique. Ensure it's referenced correctly wherever used in the codebase.
113-113
: Review the content for accuracy and clarity.The examples provided under
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
are clear and seem logically consistent with the expected output. However, ensure that the factual information, especially historical or sports-related data, is accurate to avoid spreading misinformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (7 hunks)
Files skipped from review as they are similar to previous changes (1)
- agential/cog/prompts/agent/critic.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (7 hunks)
Additional comments not posted (1)
agential/cog/prompts/agent/critic.py (1)
Line range hint
112-119
: New constant and examples added forAMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
.The newly added constant and examples for critiquing the plausibility and truthfulness of answers are well-integrated. The examples are detailed and cover different scenarios, which should enhance the model's ability to critique responses effectively.
- Correctness: The examples provided follow a consistent format and logically address the critique points.
- Maintainability: The structured format makes it easy to understand and maintain. Future additions or modifications can be made with minimal effort.
- Performance: Since this is primarily static content for instructional purposes, performance concerns are minimal.
Overall, this addition should enrich the critique capabilities of the model with specific focus areas like plausibility and truthfulness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (7 hunks)
Additional comments not posted (2)
agential/cog/prompts/agent/critic.py (2)
113-113
: Ensure proper usage and documentation for the new constantAMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
.The new constant
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
has been added to hold examples specifically for critiquing the topic of men's figure skating as a summer Olympic sport. It's crucial to ensure that this constant is properly used throughout the application and is well-documented to inform other developers of its purpose and usage context.
113-113
: Verify the accuracy and clarity of the new examples added.The newly added examples under
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
provide detailed critique scenarios. It is essential to verify that these examples are accurate, clear, and provide meaningful insights into the critique process. Additionally, check that these examples align with the intended educational or functional purpose of the module.Verification successful
The newly added examples under
AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC
are accurate and clear. They provide meaningful insights into the critique process by evaluating the plausibility and truthfulness of the proposed answers.
- The example involving the question about men's figure skating as a summer Olympic sport is well-constructed and aligns with the module's educational purpose.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for consistency and clarity in the examples provided for critiquing men's figure skating as a summer Olympic sport. # Test: Search for additional references or documentation related to the examples. rg --type py 'men\'s figure skating as a summer Olympic sport' --context 5Length of output: 93
Script:
#!/bin/bash # Corrected script to search for references or documentation related to the examples. rg --type py 'men\'s figure skating as a summer Olympic sport' --context 5 # Search for the definition of AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC and display its contents. rg 'AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC' --context 10Length of output: 93
Script:
#!/bin/bash # Corrected script to search for references or documentation related to the examples. rg --type py "men's figure skating as a summer Olympic sport" --context 5 # Search for the definition of AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC and display its contents. rg 'AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC' --context 10Length of output: 4086
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- agential/cog/prompts/benchmark/mbpp.py (1 hunks)
- agential/cog/prompts/benchmark/tabmwp.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
- agential/cog/prompts/benchmark/mbpp.py
- agential/cog/prompts/benchmark/tabmwp.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- agential/cog/prompts/benchmark/tabmwp.py (1 hunks)
- agential/cog/prompts/benchmark/triviaqa.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
- agential/cog/prompts/benchmark/tabmwp.py
- agential/cog/prompts/benchmark/triviaqa.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (9 hunks)
Files not summarized due to errors (1)
- agential/cog/prompts/agent/critic.py: Error: Message exceeds token limit
Additional comments not posted (9)
agential/cog/prompts/agent/critic.py (9)
277-278
: Typographical Error: Inconsistent quotation marks.Ensure consistent use of quotation marks for the question and answer strings.
- Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into? - Proposed Answer: Let's think step by step. The eastern sector of the Colorado orogeny extends into an area with an elevation range of 1,800 to 7,000 ft. So the answer is: 1,800 to 7,000 ft. + Question: "What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?" + Proposed Answer: "Let's think step by step. The eastern sector of the Colorado orogeny extends into an area with an elevation range of 1,800 to 7,000 ft. So the answer is: 1,800 to 7,000 ft."
307-308
: Typographical Error: Inconsistent quotation marks.Ensure consistent use of quotation marks for the question and answer strings.
- Question: Musician and satirist Allie Goertz wrote a song about the "The Simpsons" character Milhouse, who Matt Groening named after who? - Proposed Answer: Let's think step by step. Allie Goertz wrote a song about Milhouse. Milhouse is a character from "The Simpsons" created by Matt Groening. Matt Groening named Milhouse after Richard Nixon. So the answer is: Richard Nixon. + Question: "Musician and satirist Allie Goertz wrote a song about the 'The Simpsons' character Milhouse, who Matt Groening named after who?" + Proposed Answer: "Let's think step by step. Allie Goertz wrote a song about Milhouse. Milhouse is a character from 'The Simpsons' created by Matt Groening. Matt Groening named Milhouse after Richard Nixon. So the answer is: Richard Nixon."
337-338
: Typographical Error: Inconsistent quotation marks.Ensure consistent use of quotation marks for the question and answer strings.
- Question: Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture? - Proposed Answer: Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture. + Question: "Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?" + Proposed Answer: "Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture."
367-368
: Typographical Error: Inconsistent quotation marks.Ensure consistent use of quotation marks for the question and answer strings.
- Question: Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture? - Proposed Answer: Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture. + Question: "Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?" + Proposed Answer: "Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture."
397-398
: Typographical Error: Inconsistent quotation marks.Ensure consistent use of quotation marks for the question and answer strings.
- Question: What profession does Nicholas Ray and Elia Kazan have in common? - Proposed Answer: Let's think step by step. Nicholas Ray and Elia Kazan both worked as directors, screenwriters, and actors. So the answer is: director, screenwriter, actor. + Question: "What profession does Nicholas Ray and Elia Kazan have in common?" + Proposed Answer: "Let's think step by step. Nicholas Ray and Elia Kazan both worked as directors, screenwriters, and actors. So the answer is: director, screenwriter, actor."
425-426
: Typographical Error: Inconsistent quotation marks.Ensure consistent use of quotation marks for the question and answer strings.
- Question: Which magazine was started first, Arthur's Magazine or First for Women? - Proposed Answer: Let's think step by step. Arthur's Magazine was first published in 1890. First for Women was first published in 1989. 1890 (Arthur's Magazine) < 1989 (First for Women), so First for Women was started first. So the answer is:First for Women. + Question: "Which magazine was started first, Arthur's Magazine or First for Women?" + Proposed Answer: "Let's think step by step. Arthur's Magazine was first published in 1890. First for Women was first published in 1989. 1890 (Arthur's Magazine) < 1989 (First for Women), so First for Women was started first. So the answer is: First for Women."
251-251
: Typographical Error: Missing closing quotation mark.The string should be enclosed in triple quotes for consistency.
- Claim: Nikolaj Coster-Waldau worked with the Fox Broadcasting Company. + Claim: Nikolaj Coster-Waldau worked with the Fox Broadcasting Company."""Likely invalid or redundant comment.
889-889
: Typographical Error: Missing closing quotation mark.The string should be enclosed in triple quotes for consistency.
- Question: Marco and his dad went strawberry picking. Marco's dad's strawberries weighed 11 pounds. If together their strawberries weighed 30 pounds. How much did Marco's strawberries weigh? + Question: "Marco and his dad went strawberry picking. Marco's dad's strawberries weighed 11 pounds. If together their strawberries weighed 30 pounds. How much did Marco's strawberries weigh?"Likely invalid or redundant comment.
950-950
: Typographical Error: Missing closing quotation mark.The string should be enclosed in triple quotes for consistency.
- Question: Each cow pasture has 10 sprinklers. How many sprinklers are in 6 cow pastures? + Question: "Each cow pasture has 10 sprinklers. How many sprinklers are in 6 cow pastures?"Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- agential/cog/prompts/agent/critic.py (10 hunks)
Files not summarized due to errors (1)
- agential/cog/prompts/agent/critic.py: Error: Message exceeds token limit
Additional comments not posted (1)
agential/cog/prompts/agent/critic.py (1)
277-277
: Typographical Error: Inconsistent quotation marks.The string should be enclosed in triple quotes for consistency with other examples.
- HOTPOTQA_FEWSHOT_EXAMPLES_CRITIC = """Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into? + HOTPOTQA_FEWSHOT_EXAMPLES_CRITIC = """Question: "What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?"Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LET'S FUCKING GO!!!!!!!!!!
Issue
Topic: Standardizing Critique Prompts for CRITIC Benchmarks
For CRITIC, we need to craft critique prompts distinct from few-shot examples. This distinction is essential for each benchmark according to the criteria outlined in Figure 1:
The presence of a green checkmark in the "Matches Figure 1" column indicates both criteria are satisfied. The "Issues" column highlights discrepancies between the two sets of benchmark few-shot examples.
Approach to Generating Examples
I generate examples with the assistance of ChatGPT:
Proposed Changes
To address these discrepancies effectively:
For benchmarks where
num few-shots > num critique few-shots
:For benchmarks where
num few-shots == num critique few-shots
AND no checkmark above (Table 1):For benchmarks where
num few-shots < num critique few-shots
:This PR aims to standardize critique prompts to align with the requirements outlined in Figure 1, ensuring coherence and accuracy in the CRITIC benchmarks.
Summary by CodeRabbit
New Features
Bug Fixes
is_not_prime
function.Enhancements