Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] Introducing AttackConfigurations #766

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

imranbohoran
Copy link

The scope of this PR

This PR has a single commit that attempts to demonstrate how the AttackConfiguration can be included and to get some feedback on the general approach. No tests have been added/updated at this point, and only the CreascendoOrchestrator and PromptSendingOrchestrator has been changed to help with the demonstration. These 2 orchestrators were chosen to give an end-to-end slice of the different variations of the attacks. This change doesn't try to address every aspect of the proposal and only looks at the introduction of the AttackConfiguration (i.e. removing orchestrator_identifier and labels is not included).

Changes

Following is a summary we've tried to capture of the main areas of change.

Changes to Memory models and operations

  • Memory models:
    • AttackConfigurationEntry
      • This is the new memory model for the attack configuration concept.
      • In addition to the fields proposed in the issue, the following fields were introduced:
        • attack_result: This is intended to capture the overall result of the attack. For multi-turn attacks, where there is an objective, this will indicate if the objective was achieved. For single-turn attacks, this will hold all the score values of the scorers used. Prior to this change, the result of multi-turn attacks are not persisted anywhere and having it stored as part of the AttackConfiguration gives us the ability to retrieve the result for reporting/off-line analysis purposes.
        • start_time and end_time: These were added to record the start and end times of an attack. This is something that is not available right now and by adding the AttackConfigurations gives us the possibility to capture the information in a sensible place.
    • PromptMemoryEntry
      • This now adds a new attack_configuration_id column. This has a foreign-key reference to the AttackConfigurations.
      • There is also a Mapped object for the AttackConfiguration that is lazily loaded. Given that this is a single object and is useful to view on the context of the PromptEntry, the mapping was created with a lazy configuration of "joined"
    • Memory Interface
      • Convenience functions to add and update the AttackConfigurations. These utilise the existing _insert_entry and _update_entries functions.

Model changes:

  • AttackConfiguration
    • Represents the application model of the AttackConfigurationEntry
  • PromptRequestPiece
    • Includes a AttackConfiguration reflect the model changes
  • SeedPrompt
    • Includes an AttackConfiguration that is used to pass relevant attack configuration per prompt. Given that single-turn attacks can have multiple prompts provided as part of the orchestrator, and given that these prompts can represent varying objectives, including the AttackConfiguration at the SeedPrompt level seemed a reasonable way to carry the AttackConfiguration when creating PromptEntries.

Orchestrator changes

  • MultiTurn Orchestrator
    • The run_attacks_async function wraps the execution of the run_attack_async (through the limited_run_attack function) with an AttackConfiguration creation and update step. This then allows all multi-turn orchestrators to be able to create and update the AttackConfiguration without changing their underlying implementation.
    • However, this poses an issue with the run_attack_async function that now expects an AttackConfiguraiton to be passed in. As an exposed API, I wouldn't think that's something a user should have to provide. Is there a need to have 2 functions to run attacks for multi-turn orchestrators or is this an opportunity only expose one (i.e. making run_attack_async the protected abstract function and making run_attacks_async the function to run attacks). Our observation has been that the run_attack_async function is only called within the MultiTurnOrchestrator, some tests and docs.
  • SingleTurn Orchestrator
    • Given that multiple prompts can be sent to this orchestrator, each prompt is considered an attack with its own objective.
    • An AttackConfiguration is created in the send_prompt_async function and is used when creating a NormalizerRequest (to include it in the SeedPrompt).
    • The AttackConfiguration is then passed through the SeedPrompt for when prompts are created (including system prompts)
    • The update to AttackConfiguration (to add the attack_result and end_time), is done in the send_normalizer_requests_async function (which seemed to be only used in the PromptSendingOrchestrator and some docs).
    • The reason for the creation and update to happen in 2 functions was due to the fact that this orchestrator can have multiple prompts been provided and the prompts are passed as a list to the upstream functions.

Scorer changes

  • Includes the AttackConfiguration in functions where prompts are created for scoring with an llm.

Normalizer changes

  • Accepts the AttackConfiguration when creating a NormalizerRequest in order populate the SeedPrompt.

Running 2 scripts using the changes in this PR results in the following data (screenshots of the table provided, if useful, we can share the duckdb file as well). The examples runs crescendo with 2 objectives and a PromptSendingOrchestrator with 2 prompts.

AttackConfiguration

Screenshot 2025-03-06 at 19 49 19

PromptEntries

Screenshot 2025-03-06 at 19 49 48

imranbohoran and others added 2 commits March 6, 2025 15:57
* Following the conversation at Azure#726
  the AttackConfiguration concept is introduced and tried out with a
  CrescendoOrchestrator and PromptSendingOrchestrator.
* The PromptEntry has a link to the AttackConfiguration and the SeedPrompt
  carries the AttackConfiguration through the function calls.
* This commit is only serving as a PoC of introducing the AttackConfiguration
  concept and the different touch points in the code base for it to work.
* As of this commit, using a Crescendo Orchestrator and PromptSendingOrchestrator
  will result in a populated AttackConfiguration and prompt entries linke to the same.
* This commit does not handle any tests and given that certain function arguments have
  been changed.

Co-authored-by: Ayomide Apantaku <[email protected]>
Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this was fast! Thanks for sharing the POC so that we can discuss in more detail. This is a great start. My comments are mainly about smaller details (passing ID vs object, where to use it vs. where not to). The bigger picture makes sense to me.

@@ -99,6 +100,7 @@ def __init__(
prompt_group_id: Optional[uuid.UUID] = None,
prompt_group_alias: Optional[str] = None,
sequence: Optional[int] = 0,
attack_configuration: Optional[AttackConfiguration] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why the AttackConfiguration would be linked on a seed prompt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it shouldn't be linked. A SeedPrompt is meant as a starting point for an attack.

SeedPrompts are often objectives. But not always. And either way, one SeedPrompt could be part of N AttackConfigurations. And an AttackConfiguration doesn't necessarily have a SeedPrompt. A lot of it doesn't necessarily make sense. What is the objective? What is the orchestrator? I don't think there should be a relationship between these two tables.

Right now you can find SeedPrompts based on the hash to the PromptRequestPieces. And those should definitely have an AttackObjective.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we ended up including the AttackConfiguration in the SeedPrompt was because single-turn attacks can have multiple prompts provided as part of the orchestrator, and given that these prompts can represent varying objectives it seemed reasonable to do so. However, we did so with some hesitation as well and agree that it's not perhaps the best place to include it. This was all based on the design of including the attack_configuration_id as part of the RequestResponsePiece (appreciate that @rlundeen2 had an alternative thought on this, our thoughts on that below).

Sticking with the first design (for now), for the purpose of this comment, would it be an option to introduce some container that carries the SeedPrompt and and AttackConfiguration (the names are as always just a suggestion):

class AttackPrompt():
  seed_prompt: SeedPrompt
  attack_configuration: AttackConfiguration

so that we have access to the attack_configuraiton_id at the point of creating a RequestResponsePiece from the SeedPrompt. This does mean that everywhere a SeedPromptGroup is been created we'll have to change it to include a list of AttackPrompts as oppose to a list of SeedPrompts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining! What is the reason for needing them to be grouped together? The attack configuration is passed to the orchestrator (either as that object, or - as we currently do - individually as labels, etc.) and seed prompts are passed in as well. So the orchestrator has all those pieces. What is the reason for the AttackPrompt container? I feel like I'm missing something that you have thought of because you've actually tried this out 🙂

@@ -58,6 +59,7 @@ def __init__(
converted_value_sha256: Optional[str] = None,
id: Optional[uuid.UUID | str] = None,
conversation_id: Optional[str] = None,
attack_configuration: Optional[AttackConfiguration] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had somehow assumed that there may be an Attack object that groups together the pieces. This put the attack config object on every piece. Maybe the ID would be enough here, along with a method to retrieve the corresponding attack config object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the database this should be an ID. But in the model I like it as an object like this. We ship PromptRequestPieces around everywhere and it's really nice to have all of this accessible. It's the same idea we have with Scores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: in new design there might not be a reference here at all

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DB model would only store the attack_configuration_id (although it does have a Mapped object for read purposes). The purpose of passing the whole object here was so that any downstream function will have access to the object (somewhat future proofing) as oppose to the ID only. As @rlundeen2 stated, it's similar to the existing Scores.

Update: in new design there might not be a reference here at all

We've put some comments on the new design below.


class AttackConfiguration(abc.ABC):
"""
Represents an Attack Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's why this is a bit confusing to me. We're putting together inputs and outputs. the conversation_objective is an input, labels as well. Others, like start_time, end_time, attack_result, are all outputs.

In the orchestrator classes, you've replaced the objective there with this object. But as a user I'd be confused. Should I set the ID? Orchestrator ID? Why is the result on this?

This seems like the object that we should get as a result of an attack, not use as input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but a user never passes it (at least as I understand it). I'm still wondering if there is more we can do to detect whether the AttackConfiguration object is "populated". Perhaps an attack_completed property or method or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current PR this is precisely what the user passes in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that having the start_time, end_time and attack_result as part of the AttackConfiguration might be confusing. They're not information passed by the user and don't necessarily represent a configuration. Instead what they do represent, is the execution of an Attack.

What are your thoughts, on having a separate entity for AttackExecution, which would have a reference to the AttackConfiguration and would have the start_time, end_time and result. And we write out an entry at the end of an attack (within the orchestrators).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot more sense to me. Perhaps to be a little nitpicky, I would just call that the AttackResult and add any metadata like start and end time into that.

It's perfectly fine to extend the existing MultiTurnAttackResult for that purpose.

from typing import Optional, Dict


class AttackConfiguration(abc.ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem abstract. Any reason for abc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, this should not be an abstract class.

Comment on lines +133 to +147
for response_to_score in responses_to_be_scored:
if response_to_score.attack_configuration:
scored_responses = [score for score in scores if score.prompt_request_response_id == response_to_score.id]
attack_result = {}
for score in scored_responses:
attack_result[score.score_category] = score.score_value

fields_to_update = {
"attack_result": attack_result,
"end_time": datetime.now()
}
self._memory.update_attack_configuration(
attack_id=response_to_score.attack_configuration.id,
update_fields=fields_to_update
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little lost here. We got the scores above. What are we doing here? Shouldn't the score be the result?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is that we wanted to capture the result per AttackConfiguration(and for each Prompt). Given that we only have a list of Score objects, we wanted to pick the appropriate Score for each prompt. And then we wanted to capture all the scorers and their results as part of that AttackConfiguration.
Given that single-turn orchestrators don't have an explicit concept of a objective, there isn't an objective_achieved property available. But with the score results representing the results for the prompt, we thought of adding the score_value for each scorer as part of the attack_result. We have made a suggestion to separate the AttackConfiguration from the results here

@@ -250,6 +252,7 @@ async def _score_value_with_llm(
system_prompt=system_prompt,
conversation_id=conversation_id,
orchestrator_identifier=orchestrator_identifier,
attack_configuration=attack_configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to figure out why we're passing it through all the scorers in dozens of places. Any idea? The orchestrator could just update the attack configuration object with the resulting scores, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the scorers also generate prompts (through the _score_value_with_llm function), in order capture the attack_configuration_id on the those prompts, we passed it into the Scorer. This way getting a trace of all prompts for an attack will including the prompts for scoring, will give the user insights into how the scoring took place.

For single-turn attacks, for each scorer add the value for the scorer
"""
labels: Mapped[dict[str, str]] = Column(JSON)
start_time = Column(DateTime, nullable=False) # The start time of the attack
Copy link
Contributor

@rlundeen2 rlundeen2 Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_time and end_time are okay as part of the object, but I don't think we should store them in the db. It'll be tricky to make sure they're always updated. I think we should get start_time and end_time based on the PromptRequestPieces that point to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update; one option is we could change AttackConfiguration to AttackResult. In that case, we're not updating things and can just store start_time and end_time for easy access

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update; one option is we could change AttackConfiguration to AttackResult. In that case, we're not updating things and can just store start_time and end_time for easy access

Agreed on not updating a record, but having a append-only entity as per our comment here. In this suggestion we are proposing introducing a new entity to hold the result and the timings of an attack.

@@ -186,7 +189,25 @@ async def run_attacks_async(

async def limited_run_attack(objective):
async with semaphore:
return await self.run_attack_async(objective=objective, memory_labels=memory_labels)
updated_memory_labels = combine_dict(existing_dict=self._global_memory_labels, new_dict=memory_labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your strategy of doing it centrally here. But to make it work we'd need to update all the multi-turn orchestrators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - I proposed a new slight variation of a new strategy where I think we can avoid updating all mutli-turn orchestrators (linking AttackStrategy only with conversation id)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - I proposed a new slight variation of a new strategy where I think we can avoid updating all mutli-turn orchestrators (linking AttackStrategy only with conversation id)

When you say, AttackStrategy, did you mean to be AttackConfiguration?

Given that an attack can have multiple conversations and these conversation_ids are minted fairly deep in creating prompts, not sure if this solves the "passing around AttackConfigurations down the stack" and having to update all orchestrators.

@@ -147,13 +150,13 @@ def _get_adversarial_chat_seed_prompt(self, seed_prompt):

@abstractmethod
async def run_attack_async(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to update this to _run_attack_async to delineate this is now private

Copy link
Contributor

@rlundeen2 rlundeen2 Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with the new design I proposed in another comment, I don't THINK we'll need to pass attack_configuration and can continue to use objective. I like this because updates will be easier and most multi-turn orchestrators won't need changes. However, I still think this should be private, because you've centralized creating the AttackConfiguration/AttackResult in the other method (which I think is a good move)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that making this private and only having one function for the user to consume would be more clearer.

*,
id: Optional[uuid.UUID | str] = None,
orchestrator_identifier: Optional[Dict[str, str]] = None,
conversation_objective: Optional[str] = None,
Copy link
Contributor

@rlundeen2 rlundeen2 Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that may change our db structure from this PR slightly. This is probably my biggest general feedback. When I reference "new design" in other comments this is what I'm talking about.

The more I think about this, the more I think an AttackConfiguration is not associated with a PromptRequestPiece as much as it is with a conversation.

What if we

  • Removed the foreign key relationship in the db, removed AttackConfigurationId from PromptRequestPiece
  • Added a conversation_id to AttackConfiguration
  • Created a new "Conversation" object that is both a collection of PromptRequestResponses AND an AttackConfiguration. This is not in the DB, just a container for these other objects.
  • Update memory_interfaces to return this in get_conversation - which we should be able to retrieve because in those methods we have the conversation_id
  • Change this from AttackConfiguration to AttackResult and is similar in structure to MultiTurnAttackResult. It's also only updated/inserted at the end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another thing with this design is we can potentially avoid passing around an AttackConfiguration to the run_attack methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't an attack often include multiple conversations? One per target usually, sometimes way more (think: TAP)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With an attack having more that one conversation, we are not sure how we'd achieve to get all the prompts for an attack. Where do you see the Conversation object being created in this approach. From what we've seen conversation_ids get minted in different places (even against the same target), resulting in multiple conversation_ids in the DB. For instance in a Crescendo attack, we only have access to the last conversation against the objective_target, and thus loose trace of all other prompts that were attempted within the attack.

"attack_result": {
"objective_achieved": result.achieved_objective
},
"end_time": datetime.now()
Copy link
Contributor

@rlundeen2 rlundeen2 Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new design, you should be able to set AttackConfiguration/AttackResult only at the end based on the result. And another thing to consider, if we're only adding an AttackConfiguration/AttackResult to the DB after an attack, it may make sense to enforce all the fields being present.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made some comments on the new design. However, we've also made a suggestion on breaking up the AttackConfiguration and results (as an AttackExecution DB model) here

)
))

responses_to_be_scored = [piece for piece in response_pieces if piece.role == "assistant"]
Copy link
Contributor

@rlundeen2 rlundeen2 Mar 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might tackle just multi_turn_orchestrators in the first PR as a step because I think that can work end to end without updating single turn.

So I recommend just not messing with this yet.

But that said, IMO we should make a couple updates so single_turn returns results in a similar way, and changes the constructor to accept "objectives". And I would still remove some of this and try to use scorer.score_responses_inferring_tasks_batch_async .

We probably also want to segregate scorers so we have an "objective_achieved" scorer that may be different from our general scorers. and use these results and some criteria to see if the objective was achieved. And that has its own complexities. I think it might be a combination of scores, etc.

I think there are a lot of nuances here, which is why I think we should punt and get multi-turn working first

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We in fact first did the PoC for a multi-turn orchestrator (crescendo), and then realised that there might be some nuances (as you point out) when fitting a single-turn orchestrator to the same model. Hence we tried to see how things would work for both. Agree that we possibly don't need to implement both together, but as a design we probably want to arrive at something that works for both(?)

@@ -324,6 +325,7 @@ async def _build_prompt_request_response(
prompt_metadata=seed_prompt.metadata,
prompt_target_identifier=target.get_identifier(),
orchestrator_identifier=orchestrator_identifier,
attack_configuration=seed_prompt.attack_configuration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can remove this if we're building based on the conversation vs prompt piece

@@ -28,6 +29,7 @@ def set_system_prompt(
system_prompt: str,
conversation_id: str,
orchestrator_identifier: Optional[dict[str, str]] = None,
attack_configuration: AttackConfiguration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely should not be part of the target

Copy link
Contributor

@rlundeen2 rlundeen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good and really clear what you're tackling! Great work! You had some really good ideas I wouldn't have thought of!

This is our first external "framework" change so it's been tricky to design since we usually do those in design meetings, so I know some of my own feedback has shifted as I've seen the implementation. Sorry about that - if I would have thought of it sooner I would have recommended it to begin with :)

This is a tough change and you're doing great. AND I want us to be nit-picky and precise about it because it's one of those changes we're going to build on, so it may take several iterations. Please tag me if you have any questions or please feel free to chat in real time in Discord. Thank you again for tackling. This is a really important change.

@imranbohoran
Copy link
Author

Thanks for all of your comments. We've enjoyed going through each of them (me and a couple of my colleagues have been looking at it as a collective).

To perhaps restate the usecase that we considered from this change, is to find a way to perform an analysis of attacks after they've been executed. And thus looking to access all prompts that participated in an Attack along with their objective (a trace of all prompts against all targets in the attack). This includes prompts for the objective, system prompts and prompts for scoring.
With this in mind, we felt capturing the overall result of an attack is useful. And hence captured it in the AttackConfiguration. Thinking though this more, inspired by your comments so far, it does feel like the result of an Attack should probably be stored separately. We've responded to the comments in-line with this thinking.

Considering @rlundeen2's new proposal, we are unsure how this will address the usecase given that there are are many conversations that take place during an attack. We've put our thoughts for this in the relevant comments as well.

We really appreciate your engagement on this with us and appreciate that this is a relatively big task for an external contribution. Therefore we are happy to have a back-and-forth on the design and any nit-picking as well. If there's a benefit of a synchronous design discussion, we are happy to join that as well (if that's something you think is possible). Or if fleshing out the initial design of this is more suitable in discord, we are happy to do that as well.

@romanlutz
Copy link
Contributor

Thanks @imranbohoran ! IMO it wouldn't hurt to talk this through over a call. I'll check with @rlundeen2 tomorrow on that. Feel free to email me at my GH username (at) microsoft (dot) com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants