-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Features] Multi-modal Jailbreaking Attack on LLaVA #587
Conversation
Wow, thanks for this! We'll get it reviewed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good generally. I think we want to put the generator into huggingface.py
and add some checks to the probe so we can ensure we're operating against the right generator type.
I think the class name should be |
Co-authored-by: Leon Derczynski <[email protected]> Signed-off-by: Tianhao Li <[email protected]>
Co-authored-by: Leon Derczynski <[email protected]> Signed-off-by: Tianhao Li <[email protected]>
Co-authored-by: Leon Derczynski <[email protected]> Signed-off-by: Tianhao Li <[email protected]>
Co-authored-by: Leon Derczynski <[email protected]> Signed-off-by: Tianhao Li <[email protected]>
Co-authored-by: Leon Derczynski <[email protected]> Signed-off-by: Tianhao Li <[email protected]>
Co-authored-by: Leon Derczynski <[email protected]> Signed-off-by: Tianhao Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome, great idea to include FigStep/SafeBench.
The image collection is a little large, and also I want to be careful about including another project's code. Could we implement a pattern where:
- The image files are not distributed with garak
- When first run, the probe checks for the image files (perhaps in the resources/ directory)
- If the image files are not there, we download from https://github.com/ThuCCSLab/FigStep/tree/main/data/images/SafeBench
- The default probe uses 80 images
- Optionally, a "Full" probe uses more images, but is deactivated by default. This can inherit from the 80 image version (or vice versa) - that's a pretty common pattern in garak
- The FigStep paper needs to be cited/recognised somewhere in the garak code, maybe in a docstring for the FigStep probe? With a reference and paper link? This seems appropriate enough
Thanks so much for working on this, this is very close and will be a big change when it hits. Really appreciate your contributions!
…arak/probes/visual_jailbreak.py
Paper title, arxiv link, and reference in acm format of FigStep are added in docstring in 39c57a2. |
Full (size 500) and small (size 80) of FigStep probe classes have been added in de6c6ef. Additionally, size check logic for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detector needs improvement, but this is probably best done with LLM-as-a-judge + the FigStep "instruction" columns (cf e.g. https://github.com/ThuCCSLab/FigStep/blob/main/data/question/SafeBench-Tiny.csv)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to work on this, but I'm happy for that to be tracked in a separate issue/PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pinging llm-as-a-judge issue: #419
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We are working on this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can download two different version (tiny and full) of dataset SafeBench
from external repo now!
Thanks for your review.
This PR include:
Please let me know if there are anything I can do better.