FEAT: Add AttackObjective table #726

rlundeen2 · 2025-02-19T17:49:41Z

[updating 3/10. The original is below the updated most recent design]

In the database, we should have a new table named ConversationAttack. A single conversation can have N ConversationAttacks but only one should be used (whichever is not pruned and has the highest confidence). A ConversationAttackResult has:

id
orchestrator_identifier
objective
SeedPromptGroupId
labels
conversation_id
result (a literal that is success|fail|pruned|adversarial_generation|in_progress)
score_id (foreign key to the score that produced the ConversationAttackResult - optional)
confidence (Confidence that the attack met the objective. A float between 0 and 1. By default it will be 0, but can be overridden for example by a human on rescoring. For a first PR, this can be mostly ignored, but please add it to the db)

There are a few different scenarios to think through.

MultiTurnOrchestrators:
- Every single MultiTurnOrchestrator conversation with the objective target will generate 1 ConversationAttackResult(which can mean many ConversationAttackResult for an orchestrator, since there can be many objectives sent, and many conversations can be pruned or had with an adversarial_target).
- At the end of a run, every conversation in the multi-turn orchestrator objective will have a "pruned" or "adversarial_generation" ConversationAttackResult EXCEPT the best result, which will be success or fail. This is similar to the current return value for multi-turn orchestrators. Future: The confidence should be passed as an argument to the orchestrator, but set as low by default. The low confidence will allow re-scoring attempts (like human in the loop scoring) to take precedence)
- It is possible to add more ConversationAttackResult to a conversation (for example, rescoring after a run)
memory_interface
- should have a get_attack_results method. This should filter similar to get_prompt_request_pieces and have orchestrator_id, objective, objective_hash, and conversation_id as parameters.
- should have a get_final_attack_result method (future). Given an objective, if multiple ConversationAttackResult are present, the one with the highest confidence that is not pruned or adversarial should "win" because ultimately there should be at most one "most correct' result per conversation.
Scorers (future)
- Scorers should be updated to take an "objective" instead of a task. This is because an objective is often distinct. But we should also make it easier to gather the things like user_request so we can use those also.
PromptSendingOrchestrator (future):
- In the future, PromptSendingOrchestrator should have an option to create new ConversationAttackResults based on scores of results - but this should be an optional argument. This is in addition to other scorers that do not necessarily add ConversationAttackResult.
- Similarly, confidence should be an argument here with a low default value.
- It's possible to have scorers that do not add ConversationAttackResult- for example if you use AzureContentFilter, it's a very valid scorer but does little to say if an objective was achieved.
- The format for SendPromptsAsync should be modified to match multli-turn. E.g. the prompts argument should be renamed to objectives and it should be run_attacks_async
PromptScoringOrchestrator (future):
- PromptScoringOrchestrator should be updated to rescore conversations and add new ConversationAttackResults, including ones with configurable confidence.

For MVP, we should add this to MultiTurnOrchestrator and create sub-issues for the future work like PromptScoringOrchestrator and Single turn.

Below this text is outdated

We need to coalesce details about an attack and tie it to promptRequestPiece. To start, this can be basic, but grow over time.

In the database, we should have a new table named AttackConfiguration. A single orchestrator can have multiple AttackConfigurations, but each "objective" would have a single AttackConfiguration - e.g. in PromptSendingOrchestrator, every prompt sent would have a single AttackConfiguration. The AttackConfiguration table looks like:

id
orchestrator_identifier (this should move from PromptMemoryEntry to here)
conversation_objective
labels (this should move from PromptMemoryEntry to here)

PromptMemoryEntry, besides removing labels and removing orchestrator_identifier, should add:

attack_configuration_id
target_type: Literal["scorer_target", "objective_target", "adversarial_target", "other"]

We likely want to add more information to this table. But maybe postpone this and keep as simple as we can to start.

PromptRequestPiece should put all this together automatically, like how we do with scores today, and include the various pieces (conversation_objective, orchestrator_identifier, labels). In other words, when memoryInterface returns PromptRequestPiece, these should always have populated conversation_objectives. This needs to be present for MVP

Similarly, an AttackConfiguration class can add methods to help find start and end of attack, etc. This doesn't need to be present immediately.

There will likely be nuances as we plumb this through. But there are a lot of advantages. E.g. we can get rid of task for score methods, because we can just use the conversation_objective

This is a hard first issue to tackle. If anyone volunteers, please work closely with the core team. Else, we'll probably tackle it.

ref: #724

The text was updated successfully, but these errors were encountered:

imranbohoran · 2025-02-23T21:31:04Z

@rlundeen2 - Thinking of this through another lens, what are your thoughts on having a separate attacks table. That table can have data pertaining to an attack (i.e. an ID, start and end timestamps, objective, a type (single-turn or multi-turn), perhaps a name etc.). When pulling out data for and capturing the data for further analysis, a start and end times for an attack seemed useful to have. We of course have the timestamp for the prompt entries, but an attack that produces multiple prompts, specially a multi-turn like crescendo, can run for a longer period. Every prompt can have a reference back to the attack (ScoreEntry.attack_id -> Attacks.id)
I view the orchestrator as the technique that orchestrates the attack(s). So was thinking if an attack can be modelled separately so all attack instance properties can be held together. Would be interested to find out your thoughts on this.

rlundeen2 · 2025-02-24T02:24:04Z

I like this approach. I'm going to update the ticket with some adjustments. Before anyone tackles this I'd love @romanlutz to take a look and help adjust with feedback

romanlutz · 2025-02-24T23:47:46Z

Nice discussion and ideas! I generally agree with all that was written above, including the updated description of the work item by @rlundeen2 . This strikes me as a rather huge item and should probably be handled by us (?)

The TreeOfAttacksWithPruningOrchestrator (short: TAP) used to spin up multiple orchestrators under the hood, but I think you refactored that away @rlundeen2 , right?

rlundeen2 · 2025-02-25T00:15:33Z

Correct, TreeOfAttacks is just one orchestrator now

rlundeen2 · 2025-02-25T17:46:18Z

@imranbohoran we're okay with you taking this - honestly it would be a huge contribution! We probably wouldn't get to it in at least six weeks otherwise. It is a tough one to tackle but we can also support any questions here and/or via Discord - and the team is happy with the design above. Do you want us to assign to you?

romanlutz · 2025-02-26T04:25:46Z

... and you can always start a draft PR where we provide instant feedback 🙂 If you want to tackle it that is.

imranbohoran · 2025-02-26T09:44:06Z

Thanks both. Happy to take this on and love the engagement and support on this. Please go ahead and assign it to me. I'll start with some draft PR(s) as suggested and be in close communication with you folks. Will post updates here, so we have a trail of the conversations.

* Following the converstion at Azure#726 the AttackConfiguraiton concept is introduced and tried out with a CrescendoOrchestrator and PromptSendingOrchestrator. * The PromptEntry has a link to the AttackConfiguration and the SeedPrompt carries the AttackConfiguration through the function calls. * This commit is only serving as a PoC of introducing the AttackConfiguration concept and the different touch points in the code base for it to work. * As of this commit, using a Crescendo Orchestrator and PromptSendingOrchestrator will result in a populated AttackConfiguration and prompt entries linke to the same. * This commit does not handle any tests and given that certain function arguments have been changed. Co-authored-by: Ayomide Apantaku <[email protected]>

* Following the conversation at Azure#726 the AttackConfiguration concept is introduced and tried out with a CrescendoOrchestrator and PromptSendingOrchestrator. * The PromptEntry has a link to the AttackConfiguration and the SeedPrompt carries the AttackConfiguration through the function calls. * This commit is only serving as a PoC of introducing the AttackConfiguration concept and the different touch points in the code base for it to work. * As of this commit, using a Crescendo Orchestrator and PromptSendingOrchestrator will result in a populated AttackConfiguration and prompt entries linke to the same. * This commit does not handle any tests and given that certain function arguments have been changed. Co-authored-by: Ayomide Apantaku <[email protected]>

imranbohoran · 2025-03-07T09:44:02Z

Hi @romanlutz/ @rlundeen2. We've created a draft PR to validate/get feedback on the approach we were thinking based on our understanding. There's a write-up in the PR description where we try to capture the highlights. It's a bit of a lengthy PR given that we wanted to demonstrate an end-to-end slice using 2 orchestrators. Would really appreciate your inputs/feedback.

rlundeen2 · 2025-03-21T17:35:21Z

@imranbohoran thanks for meeting with us a couple weeks ago. Checking in that you're still working on this? Any questions/comments?

rlundeen2 · 2025-03-28T22:29:20Z

@imranbohoran we're starting some related work so this will update! Design docs just don't work well as a github issue. I am happy to share if/when you have time, but ETA about a month or so as we chip away at different scenarios (related to above and including some other stuff).

For now, I'm assigning this to myself but there are a lot of tasks here so message us on Discord to coordinate if you want to take any

rlundeen2 mentioned this issue Feb 19, 2025

FEAT: Improve prompts reference to objectives for offline results analysis #724

Closed

romanlutz changed the title ~~FEAT: Add Objectives to converstations~~ FEAT: Add Objectives to conversations Feb 20, 2025

rlundeen2 changed the title ~~FEAT: Add Objectives to conversations~~ FEAT: Add AttackConfiguration table, allowing objectives in memory Feb 24, 2025

romanlutz added the enhancement New feature or request label Feb 25, 2025

romanlutz assigned imranbohoran Feb 27, 2025

imranbohoran mentioned this issue Mar 7, 2025

[Proof of Concept] Introducing AttackConfigurations #766

Draft

rlundeen2 changed the title ~~FEAT: Add AttackConfiguration table, allowing objectives in memory~~ FEAT: Add AttackResults table Mar 11, 2025

rlundeen2 changed the title ~~FEAT: Add AttackResults table~~ FEAT: Add Attack table Mar 25, 2025

rlundeen2 changed the title ~~FEAT: Add Attack table~~ FEAT: Add AttackObjective table Mar 26, 2025

rlundeen2 self-assigned this Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add AttackObjective table #726

FEAT: Add AttackObjective table #726

rlundeen2 commented Feb 19, 2025 •

edited

Loading

imranbohoran commented Feb 23, 2025

rlundeen2 commented Feb 24, 2025 •

edited

Loading

romanlutz commented Feb 24, 2025

rlundeen2 commented Feb 25, 2025

rlundeen2 commented Feb 25, 2025 •

edited

Loading

romanlutz commented Feb 26, 2025

imranbohoran commented Feb 26, 2025

imranbohoran commented Mar 7, 2025 •

edited

Loading

rlundeen2 commented Mar 21, 2025

rlundeen2 commented Mar 28, 2025

FEAT: Add AttackObjective table #726

FEAT: Add AttackObjective table #726

Comments

rlundeen2 commented Feb 19, 2025 • edited Loading

imranbohoran commented Feb 23, 2025

rlundeen2 commented Feb 24, 2025 • edited Loading

romanlutz commented Feb 24, 2025

rlundeen2 commented Feb 25, 2025

rlundeen2 commented Feb 25, 2025 • edited Loading

romanlutz commented Feb 26, 2025

imranbohoran commented Feb 26, 2025

imranbohoran commented Mar 7, 2025 • edited Loading

rlundeen2 commented Mar 21, 2025

rlundeen2 commented Mar 28, 2025

rlundeen2 commented Feb 19, 2025 •

edited

Loading

rlundeen2 commented Feb 24, 2025 •

edited

Loading

rlundeen2 commented Feb 25, 2025 •

edited

Loading

imranbohoran commented Mar 7, 2025 •

edited

Loading