Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Add AttackObjective table #726

Open
rlundeen2 opened this issue Feb 19, 2025 · 10 comments
Open

FEAT: Add AttackObjective table #726

rlundeen2 opened this issue Feb 19, 2025 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@rlundeen2
Copy link
Contributor

rlundeen2 commented Feb 19, 2025

[updating 3/10. The original is below the updated most recent design]

In the database, we should have a new table named ConversationAttack. A single conversation can have N ConversationAttacks but only one should be used (whichever is not pruned and has the highest confidence). A ConversationAttackResult has:

  • id
  • orchestrator_identifier
  • objective
  • SeedPromptGroupId
  • labels
  • conversation_id
  • result (a literal that is success|fail|pruned|adversarial_generation|in_progress)
  • score_id (foreign key to the score that produced the ConversationAttackResult - optional)
  • confidence (Confidence that the attack met the objective. A float between 0 and 1. By default it will be 0, but can be overridden for example by a human on rescoring. For a first PR, this can be mostly ignored, but please add it to the db)

There are a few different scenarios to think through.

  • MultiTurnOrchestrators:
    • Every single MultiTurnOrchestrator conversation with the objective target will generate 1 ConversationAttackResult(which can mean many ConversationAttackResult for an orchestrator, since there can be many objectives sent, and many conversations can be pruned or had with an adversarial_target).
    • At the end of a run, every conversation in the multi-turn orchestrator objective will have a "pruned" or "adversarial_generation" ConversationAttackResult EXCEPT the best result, which will be success or fail. This is similar to the current return value for multi-turn orchestrators. Future: The confidence should be passed as an argument to the orchestrator, but set as low by default. The low confidence will allow re-scoring attempts (like human in the loop scoring) to take precedence)
    • It is possible to add more ConversationAttackResult to a conversation (for example, rescoring after a run)
  • memory_interface
    • should have a get_attack_results method. This should filter similar to get_prompt_request_pieces and have orchestrator_id, objective, objective_hash, and conversation_id as parameters.
    • should have a get_final_attack_result method (future). Given an objective, if multiple ConversationAttackResult are present, the one with the highest confidence that is not pruned or adversarial should "win" because ultimately there should be at most one "most correct' result per conversation.
  • Scorers (future)
    • Scorers should be updated to take an "objective" instead of a task. This is because an objective is often distinct. But we should also make it easier to gather the things like user_request so we can use those also.
  • PromptSendingOrchestrator (future):
    • In the future, PromptSendingOrchestrator should have an option to create new ConversationAttackResults based on scores of results - but this should be an optional argument. This is in addition to other scorers that do not necessarily add ConversationAttackResult.
    • Similarly, confidence should be an argument here with a low default value.
    • It's possible to have scorers that do not add ConversationAttackResult- for example if you use AzureContentFilter, it's a very valid scorer but does little to say if an objective was achieved.
    • The format for SendPromptsAsync should be modified to match multli-turn. E.g. the prompts argument should be renamed to objectives and it should be run_attacks_async
  • PromptScoringOrchestrator (future):
    • PromptScoringOrchestrator should be updated to rescore conversations and add new ConversationAttackResults, including ones with configurable confidence.

For MVP, we should add this to MultiTurnOrchestrator and create sub-issues for the future work like PromptScoringOrchestrator and Single turn.


Below this text is outdated

We need to coalesce details about an attack and tie it to promptRequestPiece. To start, this can be basic, but grow over time.

In the database, we should have a new table named AttackConfiguration. A single orchestrator can have multiple AttackConfigurations, but each "objective" would have a single AttackConfiguration - e.g. in PromptSendingOrchestrator, every prompt sent would have a single AttackConfiguration. The AttackConfiguration table looks like:

  • id
  • orchestrator_identifier (this should move from PromptMemoryEntry to here)
  • conversation_objective
  • labels (this should move from PromptMemoryEntry to here)

PromptMemoryEntry, besides removing labels and removing orchestrator_identifier, should add:

  • attack_configuration_id
  • target_type: Literal["scorer_target", "objective_target", "adversarial_target", "other"]

We likely want to add more information to this table. But maybe postpone this and keep as simple as we can to start.

PromptRequestPiece should put all this together automatically, like how we do with scores today, and include the various pieces (conversation_objective, orchestrator_identifier, labels). In other words, when memoryInterface returns PromptRequestPiece, these should always have populated conversation_objectives. This needs to be present for MVP

Similarly, an AttackConfiguration class can add methods to help find start and end of attack, etc. This doesn't need to be present immediately.

There will likely be nuances as we plumb this through. But there are a lot of advantages. E.g. we can get rid of task for score methods, because we can just use the conversation_objective

This is a hard first issue to tackle. If anyone volunteers, please work closely with the core team. Else, we'll probably tackle it.

ref: #724

@romanlutz romanlutz changed the title FEAT: Add Objectives to converstations FEAT: Add Objectives to conversations Feb 20, 2025
@imranbohoran
Copy link

@rlundeen2 - Thinking of this through another lens, what are your thoughts on having a separate attacks table. That table can have data pertaining to an attack (i.e. an ID, start and end timestamps, objective, a type (single-turn or multi-turn), perhaps a name etc.). When pulling out data for and capturing the data for further analysis, a start and end times for an attack seemed useful to have. We of course have the timestamp for the prompt entries, but an attack that produces multiple prompts, specially a multi-turn like crescendo, can run for a longer period. Every prompt can have a reference back to the attack (ScoreEntry.attack_id -> Attacks.id)
I view the orchestrator as the technique that orchestrates the attack(s). So was thinking if an attack can be modelled separately so all attack instance properties can be held together. Would be interested to find out your thoughts on this.

@rlundeen2
Copy link
Contributor Author

rlundeen2 commented Feb 24, 2025

I like this approach. I'm going to update the ticket with some adjustments. Before anyone tackles this I'd love @romanlutz to take a look and help adjust with feedback

@rlundeen2 rlundeen2 changed the title FEAT: Add Objectives to conversations FEAT: Add AttackConfiguration table, allowing objectives in memory Feb 24, 2025
@romanlutz
Copy link
Contributor

Nice discussion and ideas! I generally agree with all that was written above, including the updated description of the work item by @rlundeen2 . This strikes me as a rather huge item and should probably be handled by us (?)

The TreeOfAttacksWithPruningOrchestrator (short: TAP) used to spin up multiple orchestrators under the hood, but I think you refactored that away @rlundeen2 , right?

@rlundeen2
Copy link
Contributor Author

Correct, TreeOfAttacks is just one orchestrator now

@romanlutz romanlutz added the enhancement New feature or request label Feb 25, 2025
@rlundeen2
Copy link
Contributor Author

rlundeen2 commented Feb 25, 2025

@imranbohoran we're okay with you taking this - honestly it would be a huge contribution! We probably wouldn't get to it in at least six weeks otherwise. It is a tough one to tackle but we can also support any questions here and/or via Discord - and the team is happy with the design above. Do you want us to assign to you?

@romanlutz
Copy link
Contributor

... and you can always start a draft PR where we provide instant feedback 🙂 If you want to tackle it that is.

@imranbohoran
Copy link

Thanks both. Happy to take this on and love the engagement and support on this. Please go ahead and assign it to me. I'll start with some draft PR(s) as suggested and be in close communication with you folks. Will post updates here, so we have a trail of the conversations.

imranbohoran added a commit to Mindgard/PyRIT that referenced this issue Mar 6, 2025
* Following the converstion at Azure#726
  the AttackConfiguraiton concept is introduced and tried out with a
  CrescendoOrchestrator and PromptSendingOrchestrator.
* The PromptEntry has a link to the AttackConfiguration and the SeedPrompt
  carries the AttackConfiguration through the function calls.
* This commit is only serving as a PoC of introducing the AttackConfiguration
  concept and the different touch points in the code base for it to work.
* As of this commit, using a Crescendo Orchestrator and PromptSendingOrchestrator
  will result in a populated AttackConfiguration and prompt entries linke to the same.
* This commit does not handle any tests and given that certain function arguments have
  been changed.

Co-authored-by: Ayomide Apantaku <[email protected]>
imranbohoran added a commit to Mindgard/PyRIT that referenced this issue Mar 6, 2025
* Following the conversation at Azure#726
  the AttackConfiguration concept is introduced and tried out with a
  CrescendoOrchestrator and PromptSendingOrchestrator.
* The PromptEntry has a link to the AttackConfiguration and the SeedPrompt
  carries the AttackConfiguration through the function calls.
* This commit is only serving as a PoC of introducing the AttackConfiguration
  concept and the different touch points in the code base for it to work.
* As of this commit, using a Crescendo Orchestrator and PromptSendingOrchestrator
  will result in a populated AttackConfiguration and prompt entries linke to the same.
* This commit does not handle any tests and given that certain function arguments have
  been changed.

Co-authored-by: Ayomide Apantaku <[email protected]>
@imranbohoran
Copy link

imranbohoran commented Mar 7, 2025

Hi @romanlutz/ @rlundeen2. We've created a draft PR to validate/get feedback on the approach we were thinking based on our understanding. There's a write-up in the PR description where we try to capture the highlights. It's a bit of a lengthy PR given that we wanted to demonstrate an end-to-end slice using 2 orchestrators. Would really appreciate your inputs/feedback.

@rlundeen2 rlundeen2 changed the title FEAT: Add AttackConfiguration table, allowing objectives in memory FEAT: Add AttackResults table Mar 11, 2025
@rlundeen2
Copy link
Contributor Author

@imranbohoran thanks for meeting with us a couple weeks ago. Checking in that you're still working on this? Any questions/comments?

@rlundeen2 rlundeen2 changed the title FEAT: Add AttackResults table FEAT: Add Attack table Mar 25, 2025
@rlundeen2 rlundeen2 changed the title FEAT: Add Attack table FEAT: Add AttackObjective table Mar 26, 2025
@rlundeen2
Copy link
Contributor Author

@imranbohoran we're starting some related work so this will update! Design docs just don't work well as a github issue. I am happy to share if/when you have time, but ETA about a month or so as we chip away at different scenarios (related to above and including some other stuff).

For now, I'm assigning this to myself but there are a lot of tasks here so message us on Discord to coordinate if you want to take any

@rlundeen2 rlundeen2 self-assigned this Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants