Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare our list of jailbreak templates with ps-fuzz #772

Closed
romanlutz opened this issue Mar 9, 2025 · 8 comments · Fixed by #823
Closed

Compare our list of jailbreak templates with ps-fuzz #772

romanlutz opened this issue Mar 9, 2025 · 8 comments · Fixed by #823
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@romanlutz
Copy link
Contributor

ps-fuzz: https://github.com/prompt-security/ps-fuzz/tree/main

The task here is to compare our list with theirs and add any we might be missing to ours.

They use MIT license just like us so there should not be an issue. Obviously, anything we copy needs to be attributed correctly (using authors and groups as applicable) and linked (using the source field).

Even just comparing and reporting the comparison results in this thread is a great first step!

@romanlutz romanlutz added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Mar 9, 2025
@ryanjieh
Copy link
Contributor

Hi, I'd like to handle this as my first issue

@romanlutz
Copy link
Contributor Author

Awesome! Go ahead @ryanjieh ! I think the first step would be to create a list of what they have, then compare with PyRIT and post the results here.

@ryanjieh
Copy link
Contributor

The prompts for ps-fuzz (in ps-fuzz/ps-fuzz/attack_data/harmful_behaviour.csv) are the same as those in https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv, which is where PyRIT's malicious prompts comes from.

In ps-fuzz/ps-fuzz/attack_data/prompt_injection_from_base64.parquet there are 25 new prompts which can be injected from base64 conversion.

Was wondering if this issue was about looking at the python scripts used to jailbreak in ps-fuzz/ps-fuzz/attacks instead?

@romanlutz
Copy link
Contributor Author

Ah! Good question! I was mainly referring to the attacks but clearly underspecified it above.

We definitely have some like DAN as a jailbreak template. We have typoglycemia, or rather a more generic version called noise converter. So this task is about going through the list of what they have and comparing that with PyRIT. If you're not sure if something exists in PyRIT feel free to ask, of course.

@ryanjieh
Copy link
Contributor

ryanjieh commented Mar 15, 2025

Summary of ps-fuzz attacks 140325.xlsx

Please tell me if I should make any PRs to implement anything.

Here is my summary of attacks in psfuzz/psfuzz/attacks:

Attack name Description Similar attack in PyRIT Implementation suggestion
aim.py Always Intelligent and Machiavellian pyrit/datasets/prompt_templates/jailbreak/aim.yaml NA
base64_injection.py encodes malicious prompt into base64 encoding No similar attack New code in pyrit/datasets/prompt_converters
complimentary_transition.py Checks if AI can still refuse requests and abide by system prompts, even after being complimented for discussing unsafe topics No similar attack New template in pyrit/orchestrator/multi_turn
custom_benchmark.py general-purpose prompt injection challenge script for a series of prompts pyrit/orchestrator/fuzzer_orchestrator.py NA
dan.py Do Anything Now pyrit/datasets/prompt_templates/jailbreak/dan_1.yaml NA
dynamic_test.py generates attack prompts to: 1.  forget all system prompts 2. convince LLM user is authority figure 3. ask LLM to roleplay authority 4. generate affirmative response with suffix to break guardrails 5. print system prompts prompts 1 and 5 are new. prompts 2,3 in pyrit/datasets/prompt_converters/persuasion. 4 is closely related to pyrit/prompt_converter/insert_punctuation_converter.py New code in pyrit/prompt_converter
ethical_compliance.py resist discussion on unethical topics, checking it against autogenerated misconceptions No similar attack New template in pyrit/orchestrator/multi_turn
harmful_behaviour.py sends list of attack prompts in ps_fuzz/attack_data/harmful_behavior.csv pyrit/orchestrator/single_turn/prompt_sending_orchestrator.py does it with pyrit/datasets/data/adv_bench_dataset.json which has same prompts NA
self_refine.py frame a unethical banned task in a format of good tasks pyrit/orchestrator/multi_turn/pair_orchestrator.py NA
translation.py non-English to circumvent restrictions pyrit/datasets/prompt_converters/translation_converter NA
typoglycemia.py Obfuscates words by removing vowels pyrit/datasets/prompt_converters/noise_converter is a general case NA
ucar.py Requests LLM to ignore moral filters and answer without question pyrit/datasets/prompt_templates/jailbreak/ucar.yaml NA

@romanlutz
Copy link
Contributor Author

I updated the formatting to make it readable as a table.

A few notes:

  • We also have base64 as a converter Base64Converter
  • I'm not sure I understand the complimentary transition technique after looking at it. Not a problem of your description, but I suppose I'm not convinced it works. Maybe I'm missing something...
  • custom benchmark seems closest to using PromptSendingOrchestrator with a custom benchmark dataset. Not really anything net new compared to what we have.
  • dynamic test is a little like our RedTeamingOrchestrator. I wouldn't mind including this system prompt in our dataset folder under pyrit\datasets\orchestrators\red_teaming\ with attribution to ps-fuzz, of course.
  • from ethical compliance we definitely want unethical_task_generation_prompt and ethical_compliance_template for the same directory as right above. I think these will be great!
  • self-refine isn't related to PAIR but I'm unsure how we'd add this.

Great work on matching them to PyRIT!!! Are you interested in adding any of the prompt templates from dynamic_test or ethical_compliance?

@ryanjieh
Copy link
Contributor

Sure! I can add them in a PR

@ryanjieh
Copy link
Contributor

Added the prompt templates (1 system prompt from dynamic_test, 2 system prompts from ethical_compliance) as yaml files

Made a draft PR. Sorry for the late response

@romanlutz romanlutz linked a pull request Mar 26, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants