Standardizing Critique Prompts for CRITIC Benchmarks #187

tedasdf · 2024-06-23T13:35:49Z

Issue

Topic: Standardizing Critique Prompts for CRITIC Benchmarks

For CRITIC, we need to craft critique prompts distinct from few-shot examples. This distinction is essential for each benchmark according to the criteria outlined in Figure 1:

The number of critique few-shot examples should equal the number of few-shot examples for each benchmark.
Each critique few-shot example within a benchmark must use the same question as every example in the few-shot examples.

The presence of a green checkmark in the "Matches Figure 1" column indicates both criteria are satisfied. The "Issues" column highlights discrepancies between the two sets of benchmark few-shot examples.

Approach to Generating Examples

I generate examples with the assistance of ChatGPT:

I initially consult ChatGPT to brainstorm initial thoughts on the question.
Using LangChain or similar tools, I perform Wikipedia searches based on the benchmark.
Based on the results from these tools, I formulate the observation statement.
For mathematical benchmarks, I verify the calculations using a calculator.
For query benchmarks, I validate the queries using the wikiquery function.

Proposed Changes

To address these discrepancies effectively:

For benchmarks where `num few-shots > num critique few-shots`:

Increase the number of critique examples.
Ensure all critique examples match the few-shot examples in terms of ordering and question usage.

For benchmarks where `num few-shots == num critique few-shots` AND no checkmark above (Table 1):

Replace existing critique examples with new ones.
Maintain consistent ordering with few-shot examples.

For benchmarks where `num few-shots < num critique few-shots`:

Reduce the number of critique examples.
Ensure all examples maintain consistent question usage and ordering.

This PR aims to standardize critique prompts to align with the requirements outlined in Figure 1, ensuring coherence and accuracy in the CRITIC benchmarks.

Summary by CodeRabbit

New Features
- Added new question-answer pairs and examples across various benchmarks, including AmbignQ, HotpotQA, SVAMP, TABMWP, and TriviaQA.
- Introduced new functions for specific tasks, such as checking bit positions and identifying long words in strings.
Bug Fixes
- Corrected logical errors in the is_not_prime function.
Enhancements
- Updated existing few-shot examples with more detailed information and refined responses.
- Enhanced problem-solving examples with structured approaches and Python code snippets for better clarity.

coderabbitai · 2024-06-23T13:35:56Z

Important

Review skipped

Review was skipped as selected files did not have any reviewable changes.

Files selected but had no reviewable changes (1)

agential/cog/prompts/agent/critic.py

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The modifications introduce enhancements across multiple benchmarking scripts related to problem-solving, Q&A examples, and Python code snippets. These changes focus on adding new examples, refining existing responses, fixing bugs, and introducing new functionality to improve the variety and accuracy of the examples. This enriches the datasets and ensures more robust and comprehensive training and testing capabilities.

Changes

Files/Groups	Change Summaries
`.../benchmark/ambignq.py`	Added new QA pairs, updated existing answers with detailed info, and refined overall responses.
`.../benchmark/hotpotqa.py`	Added new few-shot examples in formats: CoT, Direct, React and modified existing CoT examples.
`.../benchmark/mbpp.py`	Corrected `is_not_prime` function, added functions `differ_at_one_bit_pos(a, b)` and `find_char_long(s)`.
`.../benchmark/svamp.py`	Added new few-shot examples with Python code snippets for math problems, expanded existing examples.
`.../benchmark/tabmwp.py`	Introduced new few-shot examples with tables, included questions requiring Python code solutions, replaced previous examples.
`.../benchmark/triviaqa.py`	Added new few-shot examples in styles: CoT, Direct, React with questions on historical events and informative answers.

Poem

In bytes and whispers coding grows,
New questions bloom where logic flows,
From Hotpot's thoughts to Trivia's lore,
Each line refined to learn much more.
Bugs squashed, new functions rise,
Solutions gleam before our eyes.
🐰✨🌟

Tip

Early access features: enabled

We are currently testing the following features in early access:

OpenAI gpt-4o model for code reviews and chat: OpenAI claims that this model is better at understanding and generating code than the previous models. We seek your feedback over the next few weeks before making it generally available.

Note:

You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.
Please join our Discord Community to provide feedback and report issues.
OSS projects are currently opted into early access features by default.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-06-23T13:38:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files	Coverage Δ
agential/cog/agent/reflexion.py	`90.24% <ø> (ø)`
agential/cog/prompts/agent/critic.py	`100.00% <ø> (ø)`

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a6b6d41 and 3af88b3.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (1 hunks)

Additional comments not posted (1)

agential/cog/prompts/agent/critic.py (1)
113-137: Verify the accuracy and plausibility of the new constant's content.

The new constant AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING contains a detailed critique of a proposed answer about men's figure skating in the Olympics. It's crucial to ensure that the historical facts presented are accurate and that the logic of the critique aligns with the intended educational or informational purpose of the content.

Verification successful

Historical Accuracy of AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING Verified

The historical claims in the constant AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING have been verified:

Claim 1: Men's figure skating has been part of the Summer Olympics in 1908 and 1920. It is no longer a Summer Olympic sport.

Claim 2: Men's figure skating was part of the 1908 Summer Olympics and has been part of the Winter Olympics since 1924.

The provided information in the constant is accurate and aligns with historical records.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the historical accuracy of the content in the new constant.

# Test: Search for historical references online. Expect: Accurate matches with the content.
echo "Please manually verify the historical accuracy of the content in AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC_TESTING by consulting reliable historical sources or databases."
Length of output: 330

Script:
#!/bin/bash
# Description: Verify historical accuracy of figure skating in the Olympics.

# Verify claim 1: Men's figure skating has never been a summer Olympic sport.
echo "Verifying claim 1: Men's figure skating has never been a summer Olympic sport."
curl -s "https://en.wikipedia.org/wiki/Figure_skating_at_the_Olympic_Games" | grep -i "summer"

# Verify claim 2: Men's figure skating was part of the 1908 Summer Olympics and has been part of the Winter Olympics since 1924.
echo "Verifying claim 2: Men's figure skating was part of the 1908 Summer Olympics and has been part of the Winter Olympics since 1924."
curl -s "https://en.wikipedia.org/wiki/Figure_skating_at_the_Olympic_Games" | grep -i "1908"
curl -s "https://en.wikipedia.org/wiki/Figure_skating_at_the_Olympic_Games" | grep -i "1924"
Length of output: 39825

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3af88b3 and 3915514.

Files selected for processing (2)

agential/cog/prompts/agent/critic.py (2 hunks)
test.ipynb (1 hunks)

Files skipped from review due to trivial changes (1)

test.ipynb

Files skipped from review as they are similar to previous changes (1)

agential/cog/prompts/agent/critic.py

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (2)

agential/cog/prompts/benchmark/ambignq.py (2)
Line range hint 8-41: Ensure Consistency in Capitalization and Punctuation

The questions and answers in the AMBIGNQ_FEWSHOT_EXAMPLES_COT constant should maintain consistent capitalization and punctuation. For instance, "washington" should be capitalized as "Washington", and each entry should end with proper punctuation for clarity and professionalism.
- Q: When is the date the last time washington won stanley cup?
+ Q: When is the date the last time Washington won Stanley Cup?
Line range hint 86-158: Improve Clarity and Completeness in Reactive Examples

The AMBIGNQ_FEWSHOT_EXAMPLES_REACT constant contains a detailed step-by-step search process. However, some entries are incomplete, such as missing observations or thoughts. It's crucial to complete these sections to provide a full context and enhance the educational value of the examples.
- Thought 2:
+ Thought 2: The evidence clearly supports the 2018 victory date.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3915514 and 8013d7c.

Files selected for processing (2)

agential/cog/prompts/agent/critic.py (2 hunks)
agential/cog/prompts/benchmark/ambignq.py (6 hunks)

Files skipped from review as they are similar to previous changes (1)

agential/cog/prompts/agent/critic.py

Additional comments not posted (1)

agential/cog/prompts/benchmark/ambignq.py (1)

Line range hint 48-81: Check for Consistency and Clarity in Direct Examples

The AMBIGNQ_FEWSHOT_EXAMPLES_DIRECT constant contains concise examples. Ensure that all examples are clear and direct, maintaining a uniform format across all entries. This will enhance readability and ease of understanding.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8013d7c and 2279887.

Files selected for processing (1)

agential/cog/prompts/benchmark/hotpotqa.py (1 hunks)

Files not reviewed due to errors (1)

agential/cog/prompts/benchmark/hotpotqa.py (no review received)

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2279887 and d44c589.

Files selected for processing (1)

agential/cog/prompts/benchmark/ambignq.py (6 hunks)

Files skipped from review as they are similar to previous changes (1)

agential/cog/prompts/benchmark/ambignq.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d44c589 and ff91ece.

Files selected for processing (6)

agential/cog/prompts/agent/critic.py (6 hunks)
agential/cog/prompts/benchmark/gsm8k.py (1 hunks)
agential/cog/prompts/benchmark/mbpp.py (2 hunks)
agential/cog/prompts/benchmark/svamp.py (1 hunks)
agential/cog/prompts/benchmark/tabmwp.py (1 hunks)
agential/cog/prompts/benchmark/triviaqa.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

agential/cog/prompts/agent/critic.py

Additional comments not posted (10)

agential/cog/prompts/benchmark/triviaqa.py (3)

3-15: Well-structured detailed few-shot examples for TriviaQA.

The examples provided are comprehensive and align well with the intent to enhance the dataset with more contextual information. Great job on ensuring that each answer is detailed and informative.

17-29: Concise few-shot examples for TriviaQA.

The concise format of these examples is well-executed, providing straightforward answers to the questions. This format is useful for quick reference and aligns with the PR's goal of expanding the dataset.

31-55: Interactive and engaging few-shot examples for TriviaQA.

The REACT format effectively demonstrates a thought process and actions, which can be highly beneficial for training AI models to understand and generate human-like reasoning. This addition is innovative and aligns well with the PR's objectives.

agential/cog/prompts/benchmark/tabmwp.py (1)

3-102: Comprehensive and educational few-shot examples involving tables and Python code.

These examples are well-crafted to demonstrate how to interpret table data and write Python code to solve specific problems. This approach not only aligns with the PR's objectives but also adds significant educational value to the dataset.

agential/cog/prompts/benchmark/svamp.py (1)

3-44: Clear and correctly implemented few-shot examples for solving word problems.

The Python code snippets are correctly implemented and solve the word problems as stated. This addition enhances the dataset by providing practical examples of how to apply Python programming to solve real-world problems.

agential/cog/prompts/benchmark/mbpp.py (1)

Line range hint 1-91: Well-implemented Python programming tasks with accurate solutions.

The tasks are clearly defined, and the solutions are correctly implemented, including the corrected is_not_prime function. These additions and corrections significantly enhance the dataset's educational value for Python programmers.

agential/cog/prompts/benchmark/gsm8k.py (4)

3-18: Well-structured and correct calculation for wages and taxes.

The code is clear, variables are well-named, and the logic accurately reflects the problem statement. Good job on this example!

21-36: Correct and well-documented algebraic solution.

The step-by-step comments enhance readability and understanding. The logic and calculations are accurate.

39-50: Accurate calculation of uneaten pizza pieces.

This example clearly demonstrates the use of arithmetic operations to solve the problem. Variables are descriptive and the logic is sound.

53-71: Effective use of mathematical functions for cost calculation.

This example is well-implemented with a practical application of math.ceil and decision-making to find the cheapest option. The calculations and logic are correct.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ff91ece and 1b468c8.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (7 hunks)

Additional comments not posted (2)

agential/cog/prompts/agent/critic.py (2)

113-113: Ensure correct naming convention and usage of the constant.

The constant AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC is well-named, following Python uppercase convention for constants. It appears to be used correctly in the context of providing examples for critique. Ensure it's referenced correctly wherever used in the codebase.

113-113: Review the content for accuracy and clarity.

The examples provided under AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC are clear and seem logically consistent with the expected output. However, ensure that the factual information, especially historical or sports-related data, is accurate to avoid spreading misinformation.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b468c8 and 9ebddbc.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (7 hunks)

Files skipped from review as they are similar to previous changes (1)

agential/cog/prompts/agent/critic.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9ebddbc and 9b57be6.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (7 hunks)

Additional comments not posted (1)

agential/cog/prompts/agent/critic.py (1)

Line range hint 112-119: New constant and examples added for AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC.

The newly added constant and examples for critiquing the plausibility and truthfulness of answers are well-integrated. The examples are detailed and cover different scenarios, which should enhance the model's ability to critique responses effectively.

Correctness: The examples provided follow a consistent format and logically address the critique points.

Maintainability: The structured format makes it easy to understand and maintain. Future additions or modifications can be made with minimal effort.

Performance: Since this is primarily static content for instructional purposes, performance concerns are minimal.

Overall, this addition should enrich the critique capabilities of the model with specific focus areas like plausibility and truthfulness.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9b57be6 and 2a0b449.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (7 hunks)

Additional comments not posted (2)

agential/cog/prompts/agent/critic.py (2)
113-113: Ensure proper usage and documentation for the new constant AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC.

The new constant AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC has been added to hold examples specifically for critiquing the topic of men's figure skating as a summer Olympic sport. It's crucial to ensure that this constant is properly used throughout the application and is well-documented to inform other developers of its purpose and usage context.

113-113: Verify the accuracy and clarity of the new examples added.

The newly added examples under AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC provide detailed critique scenarios. It is essential to verify that these examples are accurate, clear, and provide meaningful insights into the critique process. Additionally, check that these examples align with the intended educational or functional purpose of the module.

Verification successful

The newly added examples under AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC are accurate and clear. They provide meaningful insights into the critique process by evaluating the plausibility and truthfulness of the proposed answers.

The example involving the question about men's figure skating as a summer Olympic sport is well-constructed and aligns with the module's educational purpose.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for consistency and clarity in the examples provided for critiquing men's figure skating as a summer Olympic sport.

# Test: Search for additional references or documentation related to the examples.
rg --type py 'men\'s figure skating as a summer Olympic sport' --context 5
Length of output: 93

Script:
#!/bin/bash
# Corrected script to search for references or documentation related to the examples.
rg --type py 'men\'s figure skating as a summer Olympic sport' --context 5

# Search for the definition of AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC and display its contents.
rg 'AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC' --context 10
Length of output: 93

Script:
#!/bin/bash
# Corrected script to search for references or documentation related to the examples.
rg --type py "men's figure skating as a summer Olympic sport" --context 5

# Search for the definition of AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC and display its contents.
rg 'AMBIGNQ_FEWSHOT_EXAMPLES_CRITIC' --context 10
Length of output: 4086

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4d6708e and 7a60ddc.

Files selected for processing (2)

agential/cog/prompts/benchmark/mbpp.py (1 hunks)
agential/cog/prompts/benchmark/tabmwp.py (1 hunks)

Files skipped from review as they are similar to previous changes (2)

agential/cog/prompts/benchmark/mbpp.py
agential/cog/prompts/benchmark/tabmwp.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7a60ddc and 0b24502.

Files selected for processing (2)

agential/cog/prompts/benchmark/tabmwp.py (1 hunks)
agential/cog/prompts/benchmark/triviaqa.py (1 hunks)

Files skipped from review as they are similar to previous changes (2)

agential/cog/prompts/benchmark/tabmwp.py
agential/cog/prompts/benchmark/triviaqa.py

coderabbitai

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0b24502 and 4825e75.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (9 hunks)

Files not summarized due to errors (1)

agential/cog/prompts/agent/critic.py: Error: Message exceeds token limit

Additional comments not posted (9)

agential/cog/prompts/agent/critic.py (9)

277-278: Typographical Error: Inconsistent quotation marks.

Ensure consistent use of quotation marks for the question and answer strings.

- Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?
- Proposed Answer: Let's think step by step. The eastern sector of the Colorado orogeny extends into an area with an elevation range of 1,800 to 7,000 ft. So the answer is: 1,800 to 7,000 ft.
+ Question: "What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?"
+ Proposed Answer: "Let's think step by step. The eastern sector of the Colorado orogeny extends into an area with an elevation range of 1,800 to 7,000 ft. So the answer is: 1,800 to 7,000 ft."

307-308: Typographical Error: Inconsistent quotation marks.

Ensure consistent use of quotation marks for the question and answer strings.

- Question: Musician and satirist Allie Goertz wrote a song about the "The Simpsons" character Milhouse, who Matt Groening named after who?
- Proposed Answer: Let's think step by step. Allie Goertz wrote a song about Milhouse. Milhouse is a character from "The Simpsons" created by Matt Groening. Matt Groening named Milhouse after Richard Nixon. So the answer is: Richard Nixon.
+ Question: "Musician and satirist Allie Goertz wrote a song about the 'The Simpsons' character Milhouse, who Matt Groening named after who?"
+ Proposed Answer: "Let's think step by step. Allie Goertz wrote a song about Milhouse. Milhouse is a character from 'The Simpsons' created by Matt Groening. Matt Groening named Milhouse after Richard Nixon. So the answer is: Richard Nixon."

337-338: Typographical Error: Inconsistent quotation marks.

Ensure consistent use of quotation marks for the question and answer strings.

- Question: Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?
- Proposed Answer: Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture.
+ Question: "Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?"
+ Proposed Answer: "Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture."

367-368: Typographical Error: Inconsistent quotation marks.

Ensure consistent use of quotation marks for the question and answer strings.

- Question: Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?
- Proposed Answer: Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture.
+ Question: "Which documentary is about Finnish rock groups, Adam Clayton Powell or The Saimaa Gesture?"
+ Proposed Answer: "Let's think step by step. Adam Clayton Powell is a name associated with an American politician and civil rights leader. The Saimaa Gesture sounds more like a title related to Finnish culture. Therefore, The Saimaa Gesture is about Finnish rock groups. So the answer is: The Saimaa Gesture."

397-398: Typographical Error: Inconsistent quotation marks.

Ensure consistent use of quotation marks for the question and answer strings.

- Question: What profession does Nicholas Ray and Elia Kazan have in common?
- Proposed Answer: Let's think step by step. Nicholas Ray and Elia Kazan both worked as directors, screenwriters, and actors. So the answer is: director, screenwriter, actor.
+ Question: "What profession does Nicholas Ray and Elia Kazan have in common?"
+ Proposed Answer: "Let's think step by step. Nicholas Ray and Elia Kazan both worked as directors, screenwriters, and actors. So the answer is: director, screenwriter, actor."

425-426: Typographical Error: Inconsistent quotation marks.

Ensure consistent use of quotation marks for the question and answer strings.

- Question: Which magazine was started first, Arthur's Magazine or First for Women?
- Proposed Answer: Let's think step by step. Arthur's Magazine was first published in 1890. First for Women was first published in 1989. 1890 (Arthur's Magazine) < 1989 (First for Women), so First for Women was started first. So the answer is:First for Women.
+ Question: "Which magazine was started first, Arthur's Magazine or First for Women?"
+ Proposed Answer: "Let's think step by step. Arthur's Magazine was first published in 1890. First for Women was first published in 1989. 1890 (Arthur's Magazine) < 1989 (First for Women), so First for Women was started first. So the answer is: First for Women."

251-251: Typographical Error: Missing closing quotation mark.

The string should be enclosed in triple quotes for consistency.

- Claim: Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.
+ Claim: Nikolaj Coster-Waldau worked with the Fox Broadcasting Company."""

Likely invalid or redundant comment.

889-889: Typographical Error: Missing closing quotation mark.

The string should be enclosed in triple quotes for consistency.

- Question: Marco and his dad went strawberry picking. Marco's dad's strawberries weighed 11 pounds. If together their strawberries weighed 30 pounds. How much did Marco's strawberries weigh?
+ Question: "Marco and his dad went strawberry picking. Marco's dad's strawberries weighed 11 pounds. If together their strawberries weighed 30 pounds. How much did Marco's strawberries weigh?"

Likely invalid or redundant comment.

950-950: Typographical Error: Missing closing quotation mark.

The string should be enclosed in triple quotes for consistency.

- Question: Each cow pasture has 10 sprinklers. How many sprinklers are in 6 cow pastures?
+ Question: "Each cow pasture has 10 sprinklers. How many sprinklers are in 6 cow pastures?"

Likely invalid or redundant comment.

agential/cog/prompts/agent/critic.py

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4825e75 and 5abbbc9.

Files selected for processing (1)

agential/cog/prompts/agent/critic.py (10 hunks)

Files not summarized due to errors (1)

agential/cog/prompts/agent/critic.py: Error: Message exceeds token limit

Additional comments not posted (1)

agential/cog/prompts/agent/critic.py (1)
277-277: Typographical Error: Inconsistent quotation marks.

The string should be enclosed in triple quotes for consistency with other examples.
- HOTPOTQA_FEWSHOT_EXAMPLES_CRITIC = """Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?
+ HOTPOTQA_FEWSHOT_EXAMPLES_CRITIC = """Question: "What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?"
Likely invalid or redundant comment.

agential/cog/prompts/agent/critic.py

alckasoc

LET'S FUCKING GO!!!!!!!!!!

'add testing examples'

3af88b3

coderabbitai bot reviewed Jun 23, 2024

View reviewed changes

'adding another example testing'

3915514

coderabbitai bot reviewed Jun 23, 2024

View reviewed changes

tedasdf added 2 commits June 23, 2024 22:52

AmbigQa Few examples

289f365

Update ambignq.py

8013d7c

coderabbitai bot reviewed Jun 23, 2024

View reviewed changes

Update hotpotqa.py

2279887

coderabbitai bot reviewed Jun 26, 2024

View reviewed changes

Update ambignq.py

d44c589

coderabbitai bot reviewed Jun 26, 2024

View reviewed changes

tedasdf added 4 commits June 26, 2024 17:24

Update trivia and gsm

664d18f

update svamp.py

fd87e44

update tabmwp

72f5f1b

minor change

ff91ece

coderabbitai bot reviewed Jun 26, 2024

View reviewed changes

GaoDalie added 2 commits June 27, 2024 15:27

first example on git add .

07530ad

first example on GSK.

1b468c8

coderabbitai bot reviewed Jun 27, 2024

View reviewed changes

Second example on GSK

9ebddbc

coderabbitai bot reviewed Jun 27, 2024

View reviewed changes

GaoDalie added 2 commits June 27, 2024 16:14

3 and 4 example on GSK.

5c6e10f

change 5 and add 6 example on GSK

9b57be6

coderabbitai bot reviewed Jun 27, 2024

View reviewed changes

7 example on GSK.

2a0b449

coderabbitai bot reviewed Jun 27, 2024

View reviewed changes

GaoDalie added 2 commits June 27, 2024 18:30

8 example on GSK

48e31d3

delete the old example

3943b93

coderabbitai bot reviewed Jul 1, 2024

View reviewed changes

fix all the issues

0b24502

coderabbitai bot reviewed Jul 1, 2024

View reviewed changes

GaoDalie and others added 13 commits July 1, 2024 16:28

fix prime function in mbpp

0cc26e2

format the file

ba1f3d3

Update ambignq.py

19dcee3

Update fever.py

3839a8c

Update triviaqa.py

15db2b5

Update hotpotqa.py

a60db43

Update tabmwp.py

a11be5d

Update svamp.py

1b25f55

Update mbpp.py

52818fc

Update gsm8k.py

48698b4

Update critic.py

4295ab6

Update critic.py

cdd5d3b

Update critic.py

4825e75

coderabbitai bot reviewed Jul 2, 2024

View reviewed changes

agential/cog/prompts/agent/critic.py Outdated Show resolved Hide resolved

agential/cog/prompts/agent/critic.py Show resolved Hide resolved

Update critic.py

5abbbc9

coderabbitai bot reviewed Jul 2, 2024

View reviewed changes

agential/cog/prompts/agent/critic.py Show resolved Hide resolved

agential/cog/prompts/agent/critic.py Show resolved Hide resolved

alckasoc added 8 commits July 1, 2024 20:25

Update critic.py

8f83801

Update critic.py

525f61b

Update critic.py

d0f0648

Update critic.py

3ee6a92

Update test.ipynb

179768c

Delete test.ipynb

1c459bd

Delete test.py

3380db7

Update critic.py

ff6b96a

alckasoc approved these changes Jul 2, 2024

View reviewed changes

alckasoc merged commit 601d318 into main Jul 2, 2024
4 checks passed

alckasoc deleted the tedasdf/adding_examples branch July 2, 2024 04:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing Critique Prompts for CRITIC Benchmarks #187

Standardizing Critique Prompts for CRITIC Benchmarks #187

tedasdf commented Jun 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 23, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Jun 23, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

alckasoc left a comment

Standardizing Critique Prompts for CRITIC Benchmarks #187

Standardizing Critique Prompts for CRITIC Benchmarks #187

Conversation

tedasdf commented Jun 23, 2024 • edited by coderabbitai bot Loading

Issue

Topic: Standardizing Critique Prompts for CRITIC Benchmarks

Approach to Generating Examples

Proposed Changes

For benchmarks where num few-shots > num critique few-shots:

For benchmarks where num few-shots == num critique few-shots AND no checkmark above (Table 1):

For benchmarks where num few-shots < num critique few-shots:

Summary by CodeRabbit

coderabbitai bot commented Jun 23, 2024 • edited Loading

Review skipped

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Jun 23, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

alckasoc left a comment

Choose a reason for hiding this comment

tedasdf commented Jun 23, 2024 •

edited by coderabbitai bot

Loading

For benchmarks where `num few-shots > num critique few-shots`:

For benchmarks where `num few-shots == num critique few-shots` AND no checkmark above (Table 1):

For benchmarks where `num few-shots < num critique few-shots`:

coderabbitai bot commented Jun 23, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

codecov bot commented Jun 23, 2024 •

edited

Loading