Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA for Self-Refine (HotpotQA, FEVER, AmbigNQ, TriviaQA) #227

Merged
merged 32 commits into from
Jul 13, 2024
Merged

Conversation

alckasoc
Copy link
Member

@alckasoc alckasoc commented Jul 12, 2024

πŸ€” Reasoning

Explain the purpose of this PR...

🚧 Changes

Describe the changes made...

βœ… PR Checklist

  • Using this PR template?
  • Linked issue?
  • Added feature?
    • Added/updated docs?
    • Added/updated tests?

Summary by CodeRabbit

  • New Features

    • Introduced self-refine agent strategies for QA tasks with specific benchmarks: HotpotQA, FEVER, TriviaQA, and AmbigNQ.
    • Added new instructions, critique prompts, and refine prompts for various benchmarks.
  • Bug Fixes

    • Updated methods in strategies to use improved comparison logic for critiques, enhancing accuracy.
  • Tests

    • Added extensive unit tests for new self-refine QA strategies and various scenarios in the SelfRefineAgent class.
  • Documentation

    • Updated the self_refine.ipynb notebook with new instructions and examples for improved user guidance.

@alckasoc alckasoc added enhancement New feature or request add-benchmark Adding support for a benchmark labels Jul 12, 2024
@alckasoc alckasoc added this to the Self-Refine milestone Jul 12, 2024
@alckasoc alckasoc self-assigned this Jul 12, 2024
Copy link
Contributor

coderabbitai bot commented Jul 12, 2024

Walkthrough

The updates enhance the self-refinement capabilities of QA benchmarks by adding new prompts, instructions, and examples for various datasets. Debugging print statements are included, new critique generation strategies are introduced, and the functionality of the SelfRefineAgent is expanded with default values and new parameters. Comprehensive testing ensures robust handling of these enhancements.

Changes

File(s) Change Summary
agential/cog/self_refine/factory.py Added and modified prompts, instructions, strategies, and examples for HOTPOTQA, FEVER, TRIVIAQA, and AMBIGNQ benchmarks. Updated the SELF_REFINE_BENCHMARK_FEWSHOTS dictionary.
agential/cog/self_refine/functional.py Enhanced _prompt_agent, _prompt_critique, and _prompt_refine functions with debugging print statements.
agential/cog/self_refine/strategies/math.py Updated generate_critique method to use EM function from agential.eval.em for answer comparison.
agential/cog/self_refine/strategies/qa.py Introduced SelfRefineQAStrategy and its subclasses for HOTPOTQA, FEVER, TRIVIAQA, and AmbigNQ. Added methods for generating critiques, updating answers, and halting conditions.
agential/cog/reflexion/agent.py Clarified return type annotations for the generate method.
agential/cog/self_refine/agent.py Added default values and a fewshot_type parameter to the generate method. Included conditional logic for setting values.
notebooks/self_refine.ipynb Added instructions, critique instructions, examples, and refinement instructions for HOTPOTQA, FEVER, TRIVIAQA, and AMBIGNQ benchmarks.
tests/cog/self_refine/test_agent.py Added tests for different fewshot_type values and unsupported fewshot_type scenarios.
tests/cog/self_refine/strategies/test_math.py Added new strategies SelfRefineSVAMPStrategy and SelfRefineTabMWPStrategy. Updated test cases accordingly.
tests/cog/self_refine/strategies/test_qa.py Added unit tests for SelfRefineQAStrategy and its subclasses, including methods for critique generation, updating answers, and halting conditions.

Poem

In the world of code so fine,
QA benches now align,
With prompts and critiques in line,
Debugging shines, errors decline.
New strategies take their stance,
Enhancing answers with a glance,
Self-refine, a true dance! πŸ‡βœ¨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Jul 12, 2024

Codecov Report

All modified and coverable lines are covered by tests βœ…

Files Coverage Ξ”
agential/cog/reflexion/agent.py 92.07% <ΓΈ> (ΓΈ)
agential/cog/self_refine/agent.py 100.00% <100.00%> (ΓΈ)
agential/cog/self_refine/factory.py 96.96% <100.00%> (+31.34%) ⬆️
agential/cog/self_refine/prompts.py 100.00% <100.00%> (ΓΈ)
agential/cog/self_refine/strategies/math.py 100.00% <100.00%> (ΓΈ)
agential/cog/self_refine/strategies/qa.py 100.00% <100.00%> (ΓΈ)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between be7526c and e40aa57.

Files selected for processing (1)
  • agential/cog/self_refine/strategies/qa.py (1 hunks)
Additional comments not posted (11)
agential/cog/self_refine/strategies/qa.py (11)

24-30: LGTM! Initialization is straightforward.

The __init__ method correctly initializes the class with the provided language model and patience parameter.


32-61: LGTM! Answer generation logic is clear.

The generate method correctly uses _prompt_agent to generate the answer and processes the result to extract the relevant part.


63-102: LGTM! Critique generation and patience handling logic is clear.

The generate_critique method correctly uses _prompt_critique to generate the critique and includes logic to handle the patience counter and halting condition.


104-114: LGTM! Output dictionary creation is straightforward.

The create_output_dict method correctly creates an output dictionary containing the answer and critique.


116-149: LGTM! Answer update logic is clear.

The update_answer_based_on_critique method correctly uses _prompt_refine to update the answer and processes the result to extract the relevant part.


151-159: LGTM! Halting condition check is straightforward.

The halting_condition method correctly checks if the halting condition has been met.


161-171: LGTM! Reset logic is straightforward.

The reset method correctly resets the strategy to its initial state.


174-177: LGTM! Inheritance is straightforward.

The SelfRefineHotQAStrategy class correctly inherits from SelfRefineQAStrategy and does not introduce any new methods or attributes.


180-183: LGTM! Inheritance is straightforward.

The SelfRefineFEVERStrategy class correctly inherits from SelfRefineQAStrategy and does not introduce any new methods or attributes.


186-189: LGTM! Inheritance is straightforward.

The SelfRefineTriviaQAStrategy class correctly inherits from SelfRefineQAStrategy and does not introduce any new methods or attributes.


192-195: LGTM! Inheritance is straightforward.

The SelfRefineAmbigNQStrategy class correctly inherits from SelfRefineQAStrategy and does not introduce any new methods or attributes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (23)
agential/cog/self_refine/prompts.py (23)

13-37: Example critique is detailed and accurate.

The critique example for the elevation range question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by specifying the source of the elevation data more explicitly.

40-61: Example critique is detailed and accurate.

The critique example for the Milhouse question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by providing additional context about the character's naming.

65-88: Example critique is detailed and accurate.

The critique example for the Finnish rock groups question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by providing additional context about the documentary.

92-113: Example critique is detailed and accurate.

The critique example for the professions question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- The proposed answer correctly identifies the professions of both individuals but does not specify which profession they have in common.
+ The proposed answer correctly identifies the professions of both individuals but should specify that they both worked as directors.

117-140: Example critique is detailed and accurate.

The critique example for the magazine question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- The proposed answer gives the wrong dates of publication for Arthur's Magazine. The final answer "Arthur's Magazine" is correct since 1844 < 1989.
+ The proposed answer correctly identifies Arthur's Magazine as the earlier publication, but the explanation should include the specific publication dates for clarity.

156-179: Example critique is detailed and accurate.

The critique example for the elevation range question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by specifying the source of the elevation data more explicitly.

186-211: Example critique is detailed and accurate.

The critique example for the Milhouse question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by providing additional context about the character's naming.

214-241: Example critique is detailed and accurate.

The critique example for the Finnish rock groups question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by providing additional context about the documentary.

244-269: Example critique is detailed and accurate.

The critique example for the professions question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- The proposed answer correctly identifies the professions of both individuals but does not specify which profession they have in common.
+ The proposed answer correctly identifies the professions of both individuals but should specify that they both worked as directors.

272-297: Example critique is detailed and accurate.

The critique example for the magazine question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- The proposed answer gives the wrong dates of publication for Arthur's Magazine. The final answer "Arthur's Magazine" is correct since 1844 < 1989.
+ The proposed answer correctly identifies Arthur's Magazine as the earlier publication, but the explanation should include the specific publication dates for clarity.

356-381: Example critique is detailed and accurate.

The critique example for the painting question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by providing additional context about the artist's work.

385-407: Example critique is detailed and accurate.

The critique example for the National Theatre question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

410-426: Example critique is detailed and accurate.

The critique example for the skier question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

428-441: Example critique is detailed and accurate.

The critique example for the author question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

490-515: Example critique is detailed and accurate.

The critique example for the painting question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, the explanation can be improved for clarity.
+ However, the explanation can be improved by providing additional context about the artist's work.

522-543: Example critique is detailed and accurate.

The critique example for the National Theatre question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

550-565: Example critique is detailed and accurate.

The critique example for the skier question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

571-585: Example critique is detailed and accurate.

The critique example for the author question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

778-807: Example critique is detailed and accurate.

The critique example for the World Cup question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

810-843: Example critique is detailed and accurate.

The critique example for the CT scan question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

847-873: Example critique is detailed and accurate.

The critique example for the Civil War question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

877-894: Example critique is detailed and accurate.

The critique example for the Nando's question is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- However, it might not be accurate.
+ However, it might not be accurate without further verification.

925-939: Example critique is detailed and accurate.

The critique example for the Fox Broadcasting claim is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding more clarity on how the explanation can be improved.

- The evidence contradicts the proposed answer, confirming he did work with Fox in the television film Virtuality.
+ The evidence contradicts the proposed answer, confirming he did work with Fox in the television film Virtuality. The explanation can be improved by specifying the role he played in the film.
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e40aa57 and 8d68636.

Files selected for processing (1)
  • agential/cog/self_refine/prompts.py (1 hunks)
Files not summarized due to errors (1)
  • agential/cog/self_refine/prompts.py: Error: Message exceeds token limit
Additional comments not posted (10)
agential/cog/self_refine/prompts.py (10)

6-10: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


143-153: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


301-313: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


320-324: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


445-455: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


591-603: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


610-614: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


899-913: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


918-922: Ensure placeholders are correctly formatted.

The placeholders {examples}, {question}, and {answer} should be consistent with the rest of the file.

Ensure that the placeholders are correctly replaced during runtime.


943-957: Example critique is detailed and accurate.

The critique example for the Stranger Things claim is well-detailed and correctly identifies the plausibility and truthfulness of the answer. However, consider adding

Comment on lines +458 to +484
TRIVIAQA_REFINE_FEWSHOT_EXAMPLES = """Q: Mendelssohn's 'Wedding March' was. originally written as incidental music for which Shakespeare play in 1842?
A: A Midsummer Night's Dream

What's the problem with the above answer?

1. Plausibility:

The question asks for the name of the Shakespeare play, and the answer is "Hamlet," which is a name of a Shakespeare play. So it's plausible.

2. Truthfulness:

Let's search the question in google:

> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for which Shakespeare play in 1842? site: wikipedia.org
> Evidence: [Wedding March (Mendelssohn) - Wikipedia] The "Wedding March" in C major, written in 1842, is one of the most famous pieces by Mendelssohn. It was written as incidental music for William Shakespeare's play "A Midsummer Night's Dream."

The evidence suggests that Mendelssohn's 'Wedding March' was written as incidental music for "A Midsummer Night's Dream," not "Hamlet."

Let's check the proposed answer:

> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for the play "Hamlet" in 1842.
> Evidence: [Hamlet - Wikipedia] "Hamlet" is a tragedy written by William Shakespeare at an uncertain date between 1599 and 1602. Mendelssohn did not write incidental music for "Hamlet."

The evidence shows that Mendelssohn did not write incidental music for "Hamlet," contradicting the proposed answer.

Above all, the proposed answer is incorrect because Mendelssohn's 'Wedding March' was not written for "Hamlet." It was actually written for "A Midsummer Night's Dream."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example critique contains an inconsistency.

The critique example for the Shakespeare play question correctly identifies the plausibility and truthfulness of the answer. However, there is an inconsistency in the initial plausibility check where the answer is incorrectly stated as "Hamlet" instead of "A Midsummer Night's Dream."

- The question asks for the name of the Shakespeare play, and the answer is "Hamlet," which is a name of a Shakespeare play. So it's plausible.
+ The question asks for the name of the Shakespeare play, and the answer is "A Midsummer Night's Dream," which is a name of a Shakespeare play. So it's plausible.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TRIVIAQA_REFINE_FEWSHOT_EXAMPLES = """Q: Mendelssohn's 'Wedding March' was. originally written as incidental music for which Shakespeare play in 1842?
A: A Midsummer Night's Dream
What's the problem with the above answer?
1. Plausibility:
The question asks for the name of the Shakespeare play, and the answer is "Hamlet," which is a name of a Shakespeare play. So it's plausible.
2. Truthfulness:
Let's search the question in google:
> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for which Shakespeare play in 1842? site: wikipedia.org
> Evidence: [Wedding March (Mendelssohn) - Wikipedia] The "Wedding March" in C major, written in 1842, is one of the most famous pieces by Mendelssohn. It was written as incidental music for William Shakespeare's play "A Midsummer Night's Dream."
The evidence suggests that Mendelssohn's 'Wedding March' was written as incidental music for "A Midsummer Night's Dream," not "Hamlet."
Let's check the proposed answer:
> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for the play "Hamlet" in 1842.
> Evidence: [Hamlet - Wikipedia] "Hamlet" is a tragedy written by William Shakespeare at an uncertain date between 1599 and 1602. Mendelssohn did not write incidental music for "Hamlet."
The evidence shows that Mendelssohn did not write incidental music for "Hamlet," contradicting the proposed answer.
Above all, the proposed answer is incorrect because Mendelssohn's 'Wedding March' was not written for "Hamlet." It was actually written for "A Midsummer Night's Dream."
The question asks for the name of the Shakespeare play, and the answer is "A Midsummer Night's Dream," which is a name of a Shakespeare play. So it's plausible.

Comment on lines +327 to +352
TRIVIAQA_CRITIQUE_FEWSHOT_EXAMPLES = """Q: Mendelssohn's 'Wedding March' was. originally written as incidental music for which Shakespeare play in 1842?
A: A Midsummer Night's Dream

What's the problem with the above answer?

1. Plausibility:

The question asks for the name of the Shakespeare play, and the answer is "Hamlet," which is a name of a Shakespeare play. So it's plausible.

2. Truthfulness:

Let's search the question in google:

> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for which Shakespeare play in 1842? site: wikipedia.org
> Evidence: [Wedding March (Mendelssohn) - Wikipedia] The "Wedding March" in C major, written in 1842, is one of the most famous pieces by Mendelssohn. It was written as incidental music for William Shakespeare's play "A Midsummer Night's Dream."

The evidence suggests that Mendelssohn's 'Wedding March' was written as incidental music for "A Midsummer Night's Dream," not "Hamlet."

Let's check the proposed answer:

> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for the play "Hamlet" in 1842.
> Evidence: [Hamlet - Wikipedia] "Hamlet" is a tragedy written by William Shakespeare at an uncertain date between 1599 and 1602. Mendelssohn did not write incidental music for "Hamlet."

The evidence shows that Mendelssohn did not write incidental music for "Hamlet," contradicting the proposed answer.

Above all, the proposed answer is incorrect because Mendelssohn's 'Wedding March' was not written for "Hamlet." It was actually written for "A Midsummer Night's Dream."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example critique contains an inconsistency.

The critique example for the Shakespeare play question correctly identifies the plausibility and truthfulness of the answer. However, there is an inconsistency in the initial plausibility check where the answer is incorrectly stated as "Hamlet" instead of "A Midsummer Night's Dream."

- The question asks for the name of the Shakespeare play, and the answer is "Hamlet," which is a name of a Shakespeare play. So it's plausible.
+ The question asks for the name of the Shakespeare play, and the answer is "A Midsummer Night's Dream," which is a name of a Shakespeare play. So it's plausible.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TRIVIAQA_CRITIQUE_FEWSHOT_EXAMPLES = """Q: Mendelssohn's 'Wedding March' was. originally written as incidental music for which Shakespeare play in 1842?
A: A Midsummer Night's Dream
What's the problem with the above answer?
1. Plausibility:
The question asks for the name of the Shakespeare play, and the answer is "Hamlet," which is a name of a Shakespeare play. So it's plausible.
2. Truthfulness:
Let's search the question in google:
> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for which Shakespeare play in 1842? site: wikipedia.org
> Evidence: [Wedding March (Mendelssohn) - Wikipedia] The "Wedding March" in C major, written in 1842, is one of the most famous pieces by Mendelssohn. It was written as incidental music for William Shakespeare's play "A Midsummer Night's Dream."
The evidence suggests that Mendelssohn's 'Wedding March' was written as incidental music for "A Midsummer Night's Dream," not "Hamlet."
Let's check the proposed answer:
> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for the play "Hamlet" in 1842.
> Evidence: [Hamlet - Wikipedia] "Hamlet" is a tragedy written by William Shakespeare at an uncertain date between 1599 and 1602. Mendelssohn did not write incidental music for "Hamlet."
The evidence shows that Mendelssohn did not write incidental music for "Hamlet," contradicting the proposed answer.
Above all, the proposed answer is incorrect because Mendelssohn's 'Wedding March' was not written for "Hamlet." It was actually written for "A Midsummer Night's Dream."
TRIVIAQA_CRITIQUE_FEWSHOT_EXAMPLES = """Q: Mendelssohn's 'Wedding March' was. originally written as incidental music for which Shakespeare play in 1842?
A: A Midsummer Night's Dream
What's the problem with the above answer?
1. Plausibility:
The question asks for the name of the Shakespeare play, and the answer is "A Midsummer Night's Dream," which is a name of a Shakespeare play. So it's plausible.
2. Truthfulness:
Let's search the question in google:
> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for which Shakespeare play in 1842? site: wikipedia.org
> Evidence: [Wedding March (Mendelssohn) - Wikipedia] The "Wedding March" in C major, written in 1842, is one of the most famous pieces by Mendelssohn. It was written as incidental music for William Shakespeare's play "A Midsummer Night's Dream."
The evidence suggests that Mendelssohn's 'Wedding March' was written as incidental music for "A Midsummer Night's Dream," not "Hamlet."
Let's check the proposed answer:
> Search Query: Mendelssohn's 'Wedding March' was originally written as incidental music for the play "Hamlet" in 1842.
> Evidence: [Hamlet - Wikipedia] "Hamlet" is a tragedy written by William Shakespeare at an uncertain date between 1599 and 1602. Mendelssohn did not write incidental music for "Hamlet."
The evidence shows that Mendelssohn did not write incidental music for "Hamlet," contradicting the proposed answer.
Above all, the proposed answer is incorrect because Mendelssohn's 'Wedding March' was not written for "Hamlet." It was actually written for "A Midsummer Night's Dream."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8d68636 and 1534a86.

Files selected for processing (2)
  • agential/cog/self_refine/strategies/math.py (2 hunks)
  • agential/cog/self_refine/strategies/qa.py (1 hunks)
Additional context used
GitHub Check: codecov/patch
agential/cog/self_refine/strategies/qa.py

[warning] 3-3: agential/cog/self_refine/strategies/qa.py#L3
Added line #L3 was not covered by tests


[warning] 5-5: agential/cog/self_refine/strategies/qa.py#L5
Added line #L5 was not covered by tests


[warning] 7-7: agential/cog/self_refine/strategies/qa.py#L7
Added line #L7 was not covered by tests


[warning] 12-13: agential/cog/self_refine/strategies/qa.py#L12-L13
Added lines #L12 - L13 were not covered by tests


[warning] 16-16: agential/cog/self_refine/strategies/qa.py#L16
Added line #L16 was not covered by tests


[warning] 25-25: agential/cog/self_refine/strategies/qa.py#L25
Added line #L25 was not covered by tests


[warning] 27-31: agential/cog/self_refine/strategies/qa.py#L27-L31
Added lines #L27 - L31 were not covered by tests


[warning] 33-33: agential/cog/self_refine/strategies/qa.py#L33
Added line #L33 was not covered by tests


[warning] 53-53: agential/cog/self_refine/strategies/qa.py#L53
Added line #L53 was not covered by tests


[warning] 61-61: agential/cog/self_refine/strategies/qa.py#L61
Added line #L61 was not covered by tests


[warning] 63-63: agential/cog/self_refine/strategies/qa.py#L63
Added line #L63 was not covered by tests


[warning] 86-86: agential/cog/self_refine/strategies/qa.py#L86
Added line #L86 was not covered by tests


[warning] 95-98: agential/cog/self_refine/strategies/qa.py#L95-L98
Added lines #L95 - L98 were not covered by tests


[warning] 100-100: agential/cog/self_refine/strategies/qa.py#L100
Added line #L100 was not covered by tests


[warning] 102-102: agential/cog/self_refine/strategies/qa.py#L102
Added line #L102 was not covered by tests


[warning] 104-104: agential/cog/self_refine/strategies/qa.py#L104
Added line #L104 was not covered by tests


[warning] 114-114: agential/cog/self_refine/strategies/qa.py#L114
Added line #L114 was not covered by tests


[warning] 116-116: agential/cog/self_refine/strategies/qa.py#L116
Added line #L116 was not covered by tests


[warning] 138-138: agential/cog/self_refine/strategies/qa.py#L138
Added line #L138 was not covered by tests


[warning] 147-147: agential/cog/self_refine/strategies/qa.py#L147
Added line #L147 was not covered by tests


[warning] 149-149: agential/cog/self_refine/strategies/qa.py#L149
Added line #L149 was not covered by tests


[warning] 151-151: agential/cog/self_refine/strategies/qa.py#L151
Added line #L151 was not covered by tests


[warning] 159-159: agential/cog/self_refine/strategies/qa.py#L159
Added line #L159 was not covered by tests


[warning] 161-161: agential/cog/self_refine/strategies/qa.py#L161
Added line #L161 was not covered by tests


[warning] 169-171: agential/cog/self_refine/strategies/qa.py#L169-L171
Added lines #L169 - L171 were not covered by tests


[warning] 174-174: agential/cog/self_refine/strategies/qa.py#L174
Added line #L174 was not covered by tests


[warning] 177-177: agential/cog/self_refine/strategies/qa.py#L177
Added line #L177 was not covered by tests


[warning] 180-180: agential/cog/self_refine/strategies/qa.py#L180
Added line #L180 was not covered by tests


[warning] 183-183: agential/cog/self_refine/strategies/qa.py#L183
Added line #L183 was not covered by tests


[warning] 186-186: agential/cog/self_refine/strategies/qa.py#L186
Added line #L186 was not covered by tests

Additional comments not posted (13)
agential/cog/self_refine/strategies/math.py (2)

13-13: LGTM! The import statement and its usage are valid.

The EM function from agential.eval.em is correctly imported and used in the generate_critique method.


Line range hint 96-101: LGTM! The logic for using the EM function and managing the patience counter is correct.

The EM function is used to compare answers, and the patience counter is incremented correctly. The halting condition is handled appropriately.

agential/cog/self_refine/strategies/qa.py (11)

25-31: LGTM! The initialization logic is correct.

The __init__ method correctly initializes the attributes of the SelfRefineQAStrategy class.

Tools
GitHub Check: codecov/patch

[warning] 25-25: agential/cog/self_refine/strategies/qa.py#L25
Added line #L25 was not covered by tests


[warning] 27-31: agential/cog/self_refine/strategies/qa.py#L27-L31
Added lines #L27 - L31 were not covered by tests


33-61: LGTM! The logic for generating an answer is correct.

The generate method uses the _prompt_agent function to generate an answer based on the provided question, examples, prompt, and additional keys.

Tools
GitHub Check: codecov/patch

[warning] 33-33: agential/cog/self_refine/strategies/qa.py#L33
Added line #L33 was not covered by tests


[warning] 53-53: agential/cog/self_refine/strategies/qa.py#L53
Added line #L53 was not covered by tests


[warning] 61-61: agential/cog/self_refine/strategies/qa.py#L61
Added line #L61 was not covered by tests


63-102: LGTM! The logic for generating a critique and managing the patience counter is correct.

The generate_critique method uses the _prompt_critique function to generate a critique and the EM function to manage the patience counter. The halting condition is handled appropriately.

Tools
GitHub Check: codecov/patch

[warning] 63-63: agential/cog/self_refine/strategies/qa.py#L63
Added line #L63 was not covered by tests


[warning] 86-86: agential/cog/self_refine/strategies/qa.py#L86
Added line #L86 was not covered by tests


[warning] 95-98: agential/cog/self_refine/strategies/qa.py#L95-L98
Added lines #L95 - L98 were not covered by tests


[warning] 100-100: agential/cog/self_refine/strategies/qa.py#L100
Added line #L100 was not covered by tests


[warning] 102-102: agential/cog/self_refine/strategies/qa.py#L102
Added line #L102 was not covered by tests


104-114: LGTM! The logic for creating the output dictionary is correct.

The create_output_dict method correctly creates a dictionary containing the answer and critique.

Tools
GitHub Check: codecov/patch

[warning] 104-104: agential/cog/self_refine/strategies/qa.py#L104
Added line #L104 was not covered by tests


[warning] 114-114: agential/cog/self_refine/strategies/qa.py#L114
Added line #L114 was not covered by tests


116-149: LGTM! The logic for updating the answer based on the critique is correct.

The update_answer_based_on_critique method uses the _prompt_refine function to update the answer based on the provided critique.

Tools
GitHub Check: codecov/patch

[warning] 116-116: agential/cog/self_refine/strategies/qa.py#L116
Added line #L116 was not covered by tests


[warning] 138-138: agential/cog/self_refine/strategies/qa.py#L138
Added line #L138 was not covered by tests


[warning] 147-147: agential/cog/self_refine/strategies/qa.py#L147
Added line #L147 was not covered by tests


[warning] 149-149: agential/cog/self_refine/strategies/qa.py#L149
Added line #L149 was not covered by tests


151-159: LGTM! The logic for checking the halting condition is correct.

The halting_condition method correctly checks if the halting condition has been met based on the patience counter.

Tools
GitHub Check: codecov/patch

[warning] 151-151: agential/cog/self_refine/strategies/qa.py#L151
Added line #L151 was not covered by tests


[warning] 159-159: agential/cog/self_refine/strategies/qa.py#L159
Added line #L159 was not covered by tests


161-171: LGTM! The logic for resetting the strategy is correct.

The reset method correctly resets the internal variables keeping track of the halting condition.

Tools
GitHub Check: codecov/patch

[warning] 161-161: agential/cog/self_refine/strategies/qa.py#L161
Added line #L161 was not covered by tests


[warning] 169-171: agential/cog/self_refine/strategies/qa.py#L169-L171
Added lines #L169 - L171 were not covered by tests


174-177: LGTM! The class inheritance is correct.

The SelfRefineHotQAStrategy class correctly inherits from SelfRefineQAStrategy.

Tools
GitHub Check: codecov/patch

[warning] 174-174: agential/cog/self_refine/strategies/qa.py#L174
Added line #L174 was not covered by tests


[warning] 177-177: agential/cog/self_refine/strategies/qa.py#L177
Added line #L177 was not covered by tests


180-183: LGTM! The class inheritance is correct.

The SelfRefineFEVERStrategy class correctly inherits from SelfRefineQAStrategy.

Tools
GitHub Check: codecov/patch

[warning] 180-180: agential/cog/self_refine/strategies/qa.py#L180
Added line #L180 was not covered by tests


[warning] 183-183: agential/cog/self_refine/strategies/qa.py#L183
Added line #L183 was not covered by tests


186-189: LGTM! The class inheritance is correct.

The SelfRefineTriviaQAStrategy class correctly inherits from SelfRefineQAStrategy.

Tools
GitHub Check: codecov/patch

[warning] 186-186: agential/cog/self_refine/strategies/qa.py#L186
Added line #L186 was not covered by tests


192-195: LGTM! The class inheritance is correct.

The SelfRefineAmbigNQStrategy class correctly inherits from SelfRefineQAStrategy.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Outside diff range, codebase verification and nitpick comments (3)
agential/cog/self_refine/prompts.py (3)

6-10: Clarify the instruction format.

The instruction for HOTPOTQA ends abruptly with "A:". Consider adding a placeholder or more context to complete the instruction.

- Q: {question}
- A: """
+ Q: {question}
+ A: {answer}"""

13-37: Ensure consistency in critique examples.

The critique examples for HOTPOTQA are generally clear and well-structured. However, ensure that each example consistently follows the same format for clarity.

- Above all, the proposed answer correctly identifies the elevation range for the area that the eastern sector of the Colorado orogeny extends into. However, the explanation can be improved for clarity.
+ Above all, the proposed answer correctly identifies the elevation range for the area that the eastern sector of the Colorado orogeny extends into. However, the explanation can be improved for clarity.
---
Q: Musician and satirist Allie Goertz wrote a song about the \"The Simpsons\" character Milhouse, who Matt Groening named after who?
A: Richard Nixon

156-179: Ensure clarity in refinement examples.

The refinement examples for HOTPOTQA are clear, but ensure that the explanations are concise and directly address the question and proposed answer.

- Above all, the proposed answer correctly identifies the elevation range for the area that the eastern sector of the Colorado orogeny extends into. However, the explanation can be improved for clarity.
+ Above all, the proposed answer correctly identifies the elevation range for the area that the eastern sector of the Colorado orogeny extends into. The explanation can be improved for clarity.
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1534a86 and 6dd6a12.

Files selected for processing (5)
  • agential/cog/self_refine/factory.py (4 hunks)
  • agential/cog/self_refine/functional.py (3 hunks)
  • agential/cog/self_refine/prompts.py (1 hunks)
  • agential/cog/self_refine/strategies/math.py (3 hunks)
  • agential/cog/self_refine/strategies/qa.py (1 hunks)
Files not summarized due to errors (1)
  • agential/cog/self_refine/prompts.py: Error: Message exceeds token limit
Files skipped from review as they are similar to previous changes (2)
  • agential/cog/self_refine/strategies/math.py
  • agential/cog/self_refine/strategies/qa.py
Additional comments not posted (7)
agential/cog/self_refine/factory.py (6)

8-17: LGTM! Imports are necessary for the new functionality.

The imports include new few-shot examples and instructions for various benchmarks. These changes are appropriate and necessary.

Also applies to: 19-19, 22-24, 26-26, 29-31, 33-33, 36-36, 41-42, 50-55


58-61: LGTM! Updates to SELF_REFINE_BENCHMARK_FEWSHOTS are appropriate.

The dictionary has been updated to include new few-shot types for various benchmarks. These changes are necessary to support the new functionality.


71-73: LGTM! Updates to SELF_REFINE_PROMPTS are appropriate.

The dictionary has been updated to include new prompts for various benchmarks. These changes are necessary to support the new functionality.

Also applies to: 76-78, 81-83, 86-88


114-117: LGTM! Updates to SELF_REFINE_FEWSHOTS are appropriate.

The dictionary has been updated to include new few-shot examples for various benchmarks. These changes are necessary to support the new functionality.

Also applies to: 119-121, 123-125, 127-129


147-150: LGTM! Updates to SELF_REFINE_STRATEGIES are appropriate.

The dictionary has been updated to include new strategies for various benchmarks. These changes are necessary to support the new functionality.


Line range hint 158-162:
LGTM! Methods in SelfRefineFactory are well-defined.

The methods to retrieve few-shot examples, prompts, and strategies based on the benchmark are well-defined and necessary for the new functionality.

Also applies to: 164-167, 169-172, 174-178, 180-184, 186-191, 193-197, 199-204, 206-210

agential/cog/self_refine/prompts.py (1)

757-779: Correct inconsistencies in refinement examples.

The refinement example for the figure skating question contains an inconsistency. The initial plausibility check incorrectly states the answer as "never" instead of "1908."

Comment on lines +617 to +775

> Search Query: Barium is used in CT scans to help highlight the digestive system
> Evidence: [Barium Sulfate (Enhancer)] Barium sulfate works by coating the inside of your esophagus, stomach, or intestines which allows them to be seen more clearly on a CT scan or other radiologic (x-ray) examination.

According to the evidence, the proposed answer is not completely accurate.

Let's search the question in google:

> Search Query: What does barium do in a ct scan? site: wikipedia.org
> Evidence: [Barium sulfate suspension - Wikipedia] Barium sulfate suspension, often simply called barium, is a contrast agent used during X-rays. Specifically it is used to improve visualization of the gastrointestinal tract (esophagus, stomach, intestines) on plain X-ray or computed tomography.

The evidence suggests that barium is used to "improve visualization of the gastrointestinal tract", which includes the digestive system. Therefore, the proposed answer should be more specific by staing "improve visualization".

> Search Query: Why barium used in CT scans?
> Evidence: [What To Expect Before Your CT Scan - St. Elizabeth's Medical Center] The barium helps to highlight body areas for the CT scan. If you are having a scan of a body part other than the abdomen you should arrive 30 minutes before your appointed time.

The evidence suggests that barium helps "highlight body areas" for the CT scan, not specifically the digestive system. However, it is possible that barium is used to highlight the digestive system in many cases, so the proposed answer is not complete.

Conclusion: While the answer "highlight the digestive system" is a common usage of barium in CT scans, it is not a complete description of barium's function. A more complete answer would be "to improve visualization of the gastrointestinal tract."

---

Q: Where was the fort located at which the first shot of the civil war was fired?
A: Charleston Harbor, South Carolina

What's the problem with the above answer?

1. Plausibility:

The question asks for the location of the fort, and the answer is "Fort Sumter", which may be a location. So it's plausible.

2. Truthfulness:

Let's search the question in google:

> Search Query: Where was the fort located at which the first shot of the civil war was fired?
> Evidence: [Civil War Begins - U.S. Senate] April 12, 1861 At 4:30 a.m. on April 12, 1861, Confederate troops fired on Fort Sumter in South Carolina’s Charleston Harbor. Less than 34 hours later, Union forces surrendered.

The evidence suggests that the first shot of the Civil War was fired at Fort Sumter, which is consistent with the proposed answer.

We then need to check whether Fort Sumter is located in Charleston, South Carolina.

> Search Query: Where is Fort Sumter located?
> Evidence: [Fort Sumter and Fort Moultrie National Historical Park (U.S. National ...] Fort Sumter is located in the middle of Charleston Harbor, and is only accessible by ferry rides through Fort Sumter Tours.

Conclusion: From the above evidence we know that the first shot of the Civil War was fired at Fort Sumter, and Fort Sumter is located in Charleston, the reasoning process is correct. However, the proposed answer should specify the location "Charleston" rather than only state "Fort Sumter".

---

Q: When did nando's come to the uk?
A: 1992

What's the problem with the above answer?

1. Plausibility:

The question asks for a time, and the answer is "1992", which is a year. So it's plausible.

2. Truthfulness:

Let's search the question in google:

> Search Query: When did nando's come to the uk? site: wikipedia.org
> Evidence: [Nando's - Wikipedia] Nando's opened its first restaurants within the United Kingdom, in 1992, in the west London suburbs of Ealing and Earls Court, initially focusing on takeaway food.

The evidence suggests that Nando's first opened in the UK in 1992, which is consistent with the proposed answer. We can provide more detailed information in the answer."""


SELF_REFINE_CRITIQUE_INSTRUCTION_AMBIGNQ = """{examples}
(END OF EXAMPLES)

Question: {question}
Proposed Answer: {answer}

What's the problem with the above answer?

1. Plausibility:

"""


AMBIGNQ_REFINE_FEWSHOT_EXAMPLES = """Q: When did men's figure skating become a summer Olympic sport?
A: 1908

What's the problem with the above answer?

1. Plausibility:

The question asks for the time men's figure skating become a summer Olympic sport, and the answer "never" does not provide a time. So it's not plausible. The answer should be a time, like year or date.

2. Truthfulness:

Let's search the question in google:

> Search Query: When did men's figure skating become a summer Olympic sport?
> Evidence: [Figure skating at the Olympic Games - Wikipedia] Figure skating was first contested in the Olympic Games at the 1908 Summer Olympics . Since 1924, the sport has been a part of the Winter Olympic Games .

The evidence suggests Figure skating became an Olympic sport at the 1908 Summer Olympics, and has been a part of the Winter Olympic Games in 1924.

The answer is wrong by saying that men's figure skating has never been a part of the Summer Olympics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct inconsistencies in critique examples.

The critique example for the figure skating question contains an inconsistency. The initial plausibility check incorrectly states the answer as "never" instead of "1908."

- The question asks for the time men's figure skating become a summer Olympic sport, and the answer "never" does not provide a time. So it's not plausible.
+ The question asks for the time men's figure skating become a summer Olympic sport, and the answer "1908" provides a time. So it's plausible.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
AMBIGNQ_CRITIQUE_FEWSHOT_EXAMPLES = """Q: When did men's figure skating become a summer Olympic sport?
A: 1908
What's the problem with the above answer?
1. Plausibility:
The question asks for the time men's figure skating become a summer Olympic sport, and the answer "never" does not provide a time. So it's not plausible. The answer should be a time, like year or date.
2. Truthfulness:
Let's search the question in google:
> Search Query: When did men's figure skating become a summer Olympic sport?
> Evidence: [Figure skating at the Olympic Games - Wikipedia] Figure skating was first contested in the Olympic Games at the 1908 Summer Olympics . Since 1924, the sport has been a part of the Winter Olympic Games .
The evidence suggests Figure skating became an Olympic sport at the 1908 Summer Olympics, and has been a part of the Winter Olympic Games in 1924.
The answer is wrong by saying that men's figure skating has never been a part of the Summer Olympics.
---
Q: When was the last time the USA men's national soccer team missed the World Cup?
A: 2018
What's the problem with the above answer?
1. Plausibility:
The question asks for a year, and the answer is "1986", which is a year. So it's plausible.
2. Truthfulness:
Let's search the question and proposed answer in google:
> Search Query: When was the last time the USA men's national soccer team missed the World Cup? site: wikipedia.org
> Evidence: [United States at the FIFA World Cup - Wikipedia] The United States participated in every World Cup from 1990 through 2014, but did not qualify in 2018, marking first time the team had missed a World Cup since 1986. They returned to the World Cup by qualifying for the 2022 edition .
> Search Query: The last time the USA men's national soccer team missed the World Cup was in 1986
> Evidence: [United States at the FIFA World Cup - Wikipedia] The United States participated in every World Cup from 1990 through 2014, but did not qualify in 2018, marking first time the team had missed a World Cup since 1986.
The evidence suggests that the USA men's national soccer team did not qualify for the World Cup in 2018, but qualify for 2022, 2018 > 1986, so the last time the USA men's national soccer team missed the World Cup was in 2018, not in 1986, which contradicts the proposed answer.
Above all, the USA men's national soccer team did miss the World Cup in 1986, but not the last time, the last time was in 2018.
---
Q: What does barium do in a ct scan?
A: to improve visualization of the gastrointestinal tract
What's the problem with the above answer?
1. Plausibility:
The question asks for the function of barium in a CT scan, and the answer is "highlight the digestive system", which is a function. So it's plausible.
2. Truthfulness:
Let's search the proposed answer:
> Search Query: Barium is used in CT scans to help highlight the digestive system
> Evidence: [Barium Sulfate (Enhancer)] Barium sulfate works by coating the inside of your esophagus, stomach, or intestines which allows them to be seen more clearly on a CT scan or other radiologic (x-ray) examination.
According to the evidence, the proposed answer is not completely accurate.
Let's search the question in google:
> Search Query: What does barium do in a ct scan? site: wikipedia.org
> Evidence: [Barium sulfate suspension - Wikipedia] Barium sulfate suspension, often simply called barium, is a contrast agent used during X-rays. Specifically it is used to improve visualization of the gastrointestinal tract (esophagus, stomach, intestines) on plain X-ray or computed tomography.
The evidence suggests that barium is used to "improve visualization of the gastrointestinal tract", which includes the digestive system. Therefore, the proposed answer should be more specific by staing "improve visualization".
> Search Query: Why barium used in CT scans?
> Evidence: [What To Expect Before Your CT Scan - St. Elizabeth's Medical Center] The barium helps to highlight body areas for the CT scan. If you are having a scan of a body part other than the abdomen you should arrive 30 minutes before your appointed time.
The evidence suggests that barium helps "highlight body areas" for the CT scan, not specifically the digestive system. However, it is possible that barium is used to highlight the digestive system in many cases, so the proposed answer is not complete.
Conclusion: While the answer "highlight the digestive system" is a common usage of barium in CT scans, it is not a complete description of barium's function. A more complete answer would be "to improve visualization of the gastrointestinal tract."
---
Q: Where was the fort located at which the first shot of the civil war was fired?
A: Charleston Harbor, South Carolina
What's the problem with the above answer?
1. Plausibility:
The question asks for the location of the fort, and the answer is "Fort Sumter", which may be a location. So it's plausible.
2. Truthfulness:
Let's search the question in google:
> Search Query: Where was the fort located at which the first shot of the civil war was fired?
> Evidence: [Civil War Begins - U.S. Senate] April 12, 1861 At 4:30 a.m. on April 12, 1861, Confederate troops fired on Fort Sumter in South Carolina’s Charleston Harbor. Less than 34 hours later, Union forces surrendered.
The evidence suggests that the first shot of the Civil War was fired at Fort Sumter, which is consistent with the proposed answer.
We then need to check whether Fort Sumter is located in Charleston, South Carolina.
> Search Query: Where is Fort Sumter located?
> Evidence: [Fort Sumter and Fort Moultrie National Historical Park (U.S. National ...] Fort Sumter is located in the middle of Charleston Harbor, and is only accessible by ferry rides through Fort Sumter Tours.
Conclusion: From the above evidence we know that the first shot of the Civil War was fired at Fort Sumter, and Fort Sumter is located in Charleston, the reasoning process is correct. However, the proposed answer should specify the location "Charleston" rather than only state "Fort Sumter".
---
Q: When did nando's come to the uk?
A: 1992
What's the problem with the above answer?
1. Plausibility:
The question asks for a time, and the answer is "1992", which is a year. So it's plausible.
2. Truthfulness:
Let's search the question in google:
> Search Query: When did nando's come to the uk? site: wikipedia.org
> Evidence: [Nando's - Wikipedia] Nando's opened its first restaurants within the United Kingdom, in 1992, in the west London suburbs of Ealing and Earls Court, initially focusing on takeaway food.
The evidence suggests that Nando's first opened in the UK in 1992, which is consistent with the proposed answer. We can provide more detailed information in the answer."""
SELF_REFINE_CRITIQUE_INSTRUCTION_AMBIGNQ = """{examples}
(END OF EXAMPLES)
Question: {question}
Proposed Answer: {answer}
What's the problem with the above answer?
1. Plausibility:
"""
AMBIGNQ_REFINE_FEWSHOT_EXAMPLES = """Q: When did men's figure skating become a summer Olympic sport?
A: 1908
What's the problem with the above answer?
1. Plausibility:
The question asks for the time men's figure skating become a summer Olympic sport, and the answer "never" does not provide a time. So it's not plausible. The answer should be a time, like year or date.
2. Truthfulness:
Let's search the question in google:
> Search Query: When did men's figure skating become a summer Olympic sport?
> Evidence: [Figure skating at the Olympic Games - Wikipedia] Figure skating was first contested in the Olympic Games at the 1908 Summer Olympics . Since 1924, the sport has been a part of the Winter Olympic Games .
The evidence suggests Figure skating became an Olympic sport at the 1908 Summer Olympics, and has been a part of the Winter Olympic Games in 1924.
The answer is wrong by saying that men's figure skating has never been a part of the Summer Olympics.
The question asks for the time men's figure skating become a summer Olympic sport, and the answer "1908" provides a time. So it's plausible.

Comment on lines 233 to 235
print("<OUT REFINE=============================================>")
print(repr(out))
print("<OUT REFINE=============================================>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logging instead of print statements for debugging.

Using print statements for debugging is not recommended for production code. Consider using the logging module to provide better control over log levels and outputs.

-  print("<OUT REFINE=============================================>")
-  print(repr(out))
-  print("<OUT REFINE=============================================>")
+  logger.debug("<OUT REFINE=============================================>")
+  logger.debug(repr(out))
+  logger.debug("<OUT REFINE=============================================>")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("<OUT REFINE=============================================>")
print(repr(out))
print("<OUT REFINE=============================================>")
logger.debug("<OUT REFINE=============================================>")
logger.debug(repr(out))
logger.debug("<OUT REFINE=============================================>")

Comment on lines 73 to 75
print("<OUT AGENT=============================================>")
print(repr(out))
print("<OUT AGENT=============================================>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logging instead of print statements for debugging.

Using print statements for debugging is not recommended for production code. Consider using the logging module to provide better control over log levels and outputs.

-  print("<OUT AGENT=============================================>")
-  print(repr(out))
-  print("<OUT AGENT=============================================>")
+  logger.debug("<OUT AGENT=============================================>")
+  logger.debug(repr(out))
+  logger.debug("<OUT AGENT=============================================>")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("<OUT AGENT=============================================>")
print(repr(out))
print("<OUT AGENT=============================================>")
logger.debug("<OUT AGENT=============================================>")
logger.debug(repr(out))
logger.debug("<OUT AGENT=============================================>")

Comment on lines 63 to 65
print("<PROMPT AGENT=============================================>")
print(prompt)
print("<PROMPT AGENT=============================================>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logging instead of print statements for debugging.

Using print statements for debugging is not recommended for production code. Consider using the logging module to provide better control over log levels and outputs.

-  print("<PROMPT AGENT=============================================>")
-  print(prompt)
-  print("<PROMPT AGENT=============================================>")
+  import logging
+  logger = logging.getLogger(__name__)
+  logger.debug("<PROMPT AGENT=============================================>")
+  logger.debug(prompt)
+  logger.debug("<PROMPT AGENT=============================================>")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("<PROMPT AGENT=============================================>")
print(prompt)
print("<PROMPT AGENT=============================================>")
import logging
logger = logging.getLogger(__name__)
logger.debug("<PROMPT AGENT=============================================>")
logger.debug(prompt)
logger.debug("<PROMPT AGENT=============================================>")

Comment on lines 152 to 154
print("<OUT CRITIQUE=============================================>")
print(repr(out))
print("<OUT CRITIQUE=============================================>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logging instead of print statements for debugging.

Using print statements for debugging is not recommended for production code. Consider using the logging module to provide better control over log levels and outputs.

-  print("<OUT CRITIQUE=============================================>")
-  print(repr(out))
-  print("<OUT CRITIQUE=============================================>")
+  logger.debug("<OUT CRITIQUE=============================================>")
+  logger.debug(repr(out))
+  logger.debug("<OUT CRITIQUE=============================================>")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("<OUT CRITIQUE=============================================>")
print(repr(out))
print("<OUT CRITIQUE=============================================>")
logger.debug("<OUT CRITIQUE=============================================>")
logger.debug(repr(out))
logger.debug("<OUT CRITIQUE=============================================>")

Comment on lines 223 to 225
print("<PROMPT REFINE=============================================>")
print(prompt)
print("<PROMPT REFINE=============================================>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logging instead of print statements for debugging.

Using print statements for debugging is not recommended for production code. Consider using the logging module to provide better control over log levels and outputs.

-  print("<PROMPT REFINE=============================================>")
-  print(prompt)
-  print("<PROMPT REFINE=============================================>")
+  logger.debug("<PROMPT REFINE=============================================>")
+  logger.debug(prompt)
+  logger.debug("<PROMPT REFINE=============================================>")
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print("<PROMPT REFINE=============================================>")
print(prompt)
print("<PROMPT REFINE=============================================>")
logger.debug("<PROMPT REFINE=============================================>")
logger.debug(prompt)
logger.debug("<PROMPT REFINE=============================================>")

Comment on lines 142 to 144
print("<PROMPT CRITIQUE=============================================>")
print(prompt)
print("<PROMPT CRITIQUE=============================================>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logging instead of print statements for debugging.

Using print statements for debugging is not recommended for production code. Consider using the logging module to provide better control over log levels and outputs.

-  print("<PROMPT CRITIQUE=============================================>")
-  print(prompt)
-  print("<PROMPT CRITIQUE=============================================>")
+  logger.debug("<PROMPT CRITIQUE=============================================>")
+  logger.debug(prompt)
+  logger.debug("<PROMPT CRITIQUE=============================================>")

Committable suggestion was skipped due to low confidence.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 6dd6a12 and d1ce8f0.

Files selected for processing (6)
  • agential/cog/reflexion/agent.py (2 hunks)
  • agential/cog/self_refine/agent.py (3 hunks)
  • agential/cog/self_refine/factory.py (6 hunks)
  • notebooks/self_refine.ipynb (6 hunks)
  • tests/cog/self_refine/test_agent.py (2 hunks)
  • tests/cog/self_refine/test_factory.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • agential/cog/reflexion/agent.py
Additional comments not posted (17)
tests/cog/self_refine/test_factory.py (3)

32-72: LGTM!

The test function test_self_refine_factory_get_strategy is well-structured and comprehensive. It covers multiple benchmarks and includes a test for an unsupported benchmark.


75-100: LGTM!

The test function test_self_refine_factory_get_fewshots is well-structured and comprehensive. It covers multiple scenarios for the get_fewshots method.


103-121: LGTM!

The test function test_self_refine_factory_get_prompts is well-structured and comprehensive. It covers multiple scenarios for the get_prompts method.

agential/cog/self_refine/agent.py (3)

Line range hint 25-30:
LGTM!

The __init__ method is well-structured and correctly initializes the class attributes.


Line range hint 116-118:
LGTM!

The reset method is well-structured and correctly resets the agent's state.


Line range hint 52-114:
LGTM! But verify the function usage in the codebase.

The generate method is well-structured and the new parameters enhance its flexibility.

However, ensure that all function calls to generate match the new signature.

tests/cog/self_refine/test_agent.py (3)

Line range hint 14-20:
LGTM!

The test function test_init is well-structured and correctly tests the initialization of the SelfRefineAgent class.


Line range hint 23-28:
LGTM!

The test function test_reset is well-structured and correctly tests the reset functionality of the SelfRefineAgent class.


Line range hint 32-136:
LGTM!

The test function test_generate is well-structured and comprehensive. It covers multiple scenarios for the generate method.

agential/cog/self_refine/factory.py (3)

Line range hint 174-188:
LGTM!

The get_fewshots method is well-structured and correctly handles various scenarios.


Line range hint 190-200:
LGTM!

The get_prompts method is well-structured and correctly handles various scenarios.


Line range hint 202-229:
LGTM!

The get_strategy method is well-structured and correctly handles various scenarios.

notebooks/self_refine.ipynb (5)

Line range hint 1-66:
LGTM! Imports are necessary for the notebook.

The import statements include various few-shot examples and self-refine instructions for different benchmarks. Ensure that all imported modules and variables are used in the notebook.


120-142: LGTM! HotpotQA section is correctly implemented.

The HotpotQA section initializes a SelfRefineAgent and generates output based on the provided question and few-shot examples. Ensure that the parameters passed to the generate method are correct.


166-188: LGTM! FEVER section is correctly implemented.

The FEVER section initializes a SelfRefineAgent and generates output based on the provided question and few-shot examples. Ensure that the parameters passed to the generate method are correct.


212-234: LGTM! AmbigNQ section is correctly implemented.

The AmbigNQ section initializes a SelfRefineAgent and generates output based on the provided question and few-shot examples. Ensure that the parameters passed to the generate method are correct.


258-280: LGTM! TriviaQA section is correctly implemented.

The TriviaQA section initializes a SelfRefineAgent and generates output based on the provided question and few-shot examples. Ensure that the parameters passed to the generate method are correct.

@alckasoc alckasoc changed the title QA for Self-Refine QA for Self-Refine (HotpotQA, FEVER, AmbigNQ, TriviaQA) Jul 13, 2024
@alckasoc alckasoc merged commit 9bd4278 into main Jul 13, 2024
4 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d1ce8f0 and 8eb0d3a.

Files selected for processing (2)
  • tests/cog/self_refine/strategies/test_math.py (2 hunks)
  • tests/cog/self_refine/strategies/test_qa.py (1 hunks)
Additional comments not posted (16)
tests/cog/self_refine/strategies/test_qa.py (8)

22-31: Initialization test looks good!

The test comprehensively checks the initialization of the SelfRefineQAStrategy class.


33-46: Generate method test looks good!

The test effectively validates the generate method of the SelfRefineQAStrategy class.


48-87: Generate critique method test looks good!

The test comprehensively validates the generate_critique method, including early stopping functionality.


89-96: Create output dict method test looks good!

The test effectively validates the create_output_dict method of the SelfRefineQAStrategy class.


98-115: Update answer based on critique method test looks good!

The test comprehensively validates the update_answer_based_on_critique method of the SelfRefineQAStrategy class.


118-128: Halting condition method test looks good!

The test effectively validates the halting_condition method of the SelfRefineQAStrategy class.


131-142: Reset method test looks good!

The test comprehensively validates the reset method of the SelfRefineQAStrategy class.


145-151: Instantiation of QA strategies test looks good!

The test effectively validates the instantiation of various QA strategy classes.

tests/cog/self_refine/strategies/test_math.py (8)

Line range hint 10-15:
Initialization test looks good!

The test comprehensively checks the initialization of the SelfRefineMathStrategy class.


Line range hint 17-26:
Generate method test looks good!

The test effectively validates the generate method of the SelfRefineMathStrategy class.


Line range hint 28-67:
Generate critique method test looks good!

The test comprehensively validates the generate_critique method, including early stopping functionality.


Line range hint 69-74:
Create output dict method test looks good!

The test effectively validates the create_output_dict method of the SelfRefineMathStrategy class.


Line range hint 76-87:
Update answer based on critique method test looks good!

The test comprehensively validates the update_answer_based_on_critique method of the SelfRefineMathStrategy class.


Line range hint 89-97:
Halting condition method test looks good!

The test effectively validates the halting_condition method of the SelfRefineMathStrategy class.


Line range hint 99-106:
Reset method test looks good!

The test comprehensively validates the reset method of the SelfRefineMathStrategy class.


Line range hint 108-153:
Instantiation of Math strategies test looks good!

The test effectively validates the instantiation of various Math strategy classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add-benchmark Adding support for a benchmark enhancement New feature or request
Projects
None yet
2 participants