-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proof of Concept] Introducing AttackConfigurations #766
base: main
Are you sure you want to change the base?
Conversation
* Following the conversation at Azure#726 the AttackConfiguration concept is introduced and tried out with a CrescendoOrchestrator and PromptSendingOrchestrator. * The PromptEntry has a link to the AttackConfiguration and the SeedPrompt carries the AttackConfiguration through the function calls. * This commit is only serving as a PoC of introducing the AttackConfiguration concept and the different touch points in the code base for it to work. * As of this commit, using a Crescendo Orchestrator and PromptSendingOrchestrator will result in a populated AttackConfiguration and prompt entries linke to the same. * This commit does not handle any tests and given that certain function arguments have been changed. Co-authored-by: Ayomide Apantaku <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this was fast! Thanks for sharing the POC so that we can discuss in more detail. This is a great start. My comments are mainly about smaller details (passing ID vs object, where to use it vs. where not to). The bigger picture makes sense to me.
@@ -99,6 +100,7 @@ def __init__( | |||
prompt_group_id: Optional[uuid.UUID] = None, | |||
prompt_group_alias: Optional[str] = None, | |||
sequence: Optional[int] = 0, | |||
attack_configuration: Optional[AttackConfiguration] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why the AttackConfiguration would be linked on a seed prompt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it shouldn't be linked. A SeedPrompt is meant as a starting point for an attack.
SeedPrompts are often objectives. But not always. And either way, one SeedPrompt could be part of N AttackConfigurations. And an AttackConfiguration doesn't necessarily have a SeedPrompt. A lot of it doesn't necessarily make sense. What is the objective? What is the orchestrator? I don't think there should be a relationship between these two tables.
Right now you can find SeedPrompts based on the hash to the PromptRequestPieces. And those should definitely have an AttackObjective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we ended up including the AttackConfiguration in the SeedPrompt was because single-turn attacks can have multiple prompts provided as part of the orchestrator, and given that these prompts can represent varying objectives it seemed reasonable to do so. However, we did so with some hesitation as well and agree that it's not perhaps the best place to include it. This was all based on the design of including the attack_configuration_id
as part of the RequestResponsePiece (appreciate that @rlundeen2 had an alternative thought on this, our thoughts on that below).
Sticking with the first design (for now), for the purpose of this comment, would it be an option to introduce some container that carries the SeedPrompt and and AttackConfiguration (the names are as always just a suggestion):
class AttackPrompt():
seed_prompt: SeedPrompt
attack_configuration: AttackConfiguration
so that we have access to the attack_configuraiton_id
at the point of creating a RequestResponsePiece from the SeedPrompt. This does mean that everywhere a SeedPromptGroup is been created we'll have to change it to include a list of AttackPrompt
s as oppose to a list of SeedPrompt
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explaining! What is the reason for needing them to be grouped together? The attack configuration is passed to the orchestrator (either as that object, or - as we currently do - individually as labels, etc.) and seed prompts are passed in as well. So the orchestrator has all those pieces. What is the reason for the AttackPrompt container? I feel like I'm missing something that you have thought of because you've actually tried this out 🙂
@@ -58,6 +59,7 @@ def __init__( | |||
converted_value_sha256: Optional[str] = None, | |||
id: Optional[uuid.UUID | str] = None, | |||
conversation_id: Optional[str] = None, | |||
attack_configuration: Optional[AttackConfiguration] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had somehow assumed that there may be an Attack object that groups together the pieces. This put the attack config object on every piece. Maybe the ID would be enough here, along with a method to retrieve the corresponding attack config object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the database this should be an ID. But in the model I like it as an object like this. We ship PromptRequestPieces around everywhere and it's really nice to have all of this accessible. It's the same idea we have with Scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: in new design there might not be a reference here at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DB model would only store the attack_configuration_id
(although it does have a Mapped object for read purposes). The purpose of passing the whole object here was so that any downstream function will have access to the object (somewhat future proofing) as oppose to the ID only. As @rlundeen2 stated, it's similar to the existing Scores.
Update: in new design there might not be a reference here at all
We've put some comments on the new design below.
|
||
class AttackConfiguration(abc.ABC): | ||
""" | ||
Represents an Attack Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's why this is a bit confusing to me. We're putting together inputs and outputs. the conversation_objective is an input, labels as well. Others, like start_time, end_time, attack_result, are all outputs.
In the orchestrator classes, you've replaced the objective there with this object. But as a user I'd be confused. Should I set the ID? Orchestrator ID? Why is the result on this?
This seems like the object that we should get as a result of an attack, not use as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, but a user never passes it (at least as I understand it). I'm still wondering if there is more we can do to detect whether the AttackConfiguration object is "populated". Perhaps an attack_completed property or method or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current PR this is precisely what the user passes in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that having the start_time
, end_time
and attack_result
as part of the AttackConfiguration might be confusing. They're not information passed by the user and don't necessarily represent a configuration. Instead what they do represent, is the execution of an Attack.
What are your thoughts, on having a separate entity for AttackExecution
, which would have a reference to the AttackConfiguration and would have the start_time
, end_time
and result
. And we write out an entry at the end of an attack (within the orchestrators).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot more sense to me. Perhaps to be a little nitpicky, I would just call that the AttackResult and add any metadata like start and end time into that.
It's perfectly fine to extend the existing MultiTurnAttackResult for that purpose.
from typing import Optional, Dict | ||
|
||
|
||
class AttackConfiguration(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem abstract. Any reason for abc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, this should not be an abstract class.
for response_to_score in responses_to_be_scored: | ||
if response_to_score.attack_configuration: | ||
scored_responses = [score for score in scores if score.prompt_request_response_id == response_to_score.id] | ||
attack_result = {} | ||
for score in scored_responses: | ||
attack_result[score.score_category] = score.score_value | ||
|
||
fields_to_update = { | ||
"attack_result": attack_result, | ||
"end_time": datetime.now() | ||
} | ||
self._memory.update_attack_configuration( | ||
attack_id=response_to_score.attack_configuration.id, | ||
update_fields=fields_to_update | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little lost here. We got the scores above. What are we doing here? Shouldn't the score be the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is that we wanted to capture the result per AttackConfiguration(and for each Prompt). Given that we only have a list of Score
objects, we wanted to pick the appropriate Score
for each prompt. And then we wanted to capture all the scorers and their results as part of that AttackConfiguration.
Given that single-turn orchestrators don't have an explicit concept of a objective, there isn't an objective_achieved
property available. But with the score results representing the results for the prompt, we thought of adding the score_value
for each scorer as part of the attack_result
. We have made a suggestion to separate the AttackConfiguration from the results here
@@ -250,6 +252,7 @@ async def _score_value_with_llm( | |||
system_prompt=system_prompt, | |||
conversation_id=conversation_id, | |||
orchestrator_identifier=orchestrator_identifier, | |||
attack_configuration=attack_configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to figure out why we're passing it through all the scorers in dozens of places. Any idea? The orchestrator could just update the attack configuration object with the resulting scores, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the scorers also generate prompts (through the _score_value_with_llm
function), in order capture the attack_configuration_id
on the those prompts, we passed it into the Scorer. This way getting a trace of all prompts for an attack will including the prompts for scoring, will give the user insights into how the scoring took place.
For single-turn attacks, for each scorer add the value for the scorer | ||
""" | ||
labels: Mapped[dict[str, str]] = Column(JSON) | ||
start_time = Column(DateTime, nullable=False) # The start time of the attack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_time and end_time are okay as part of the object, but I don't think we should store them in the db. It'll be tricky to make sure they're always updated. I think we should get start_time and end_time based on the PromptRequestPieces that point to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update; one option is we could change AttackConfiguration
to AttackResult
. In that case, we're not updating things and can just store start_time and end_time for easy access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update; one option is we could change
AttackConfiguration
toAttackResult
. In that case, we're not updating things and can just store start_time and end_time for easy access
Agreed on not updating a record, but having a append-only entity as per our comment here. In this suggestion we are proposing introducing a new entity to hold the result and the timings of an attack.
@@ -186,7 +189,25 @@ async def run_attacks_async( | |||
|
|||
async def limited_run_attack(objective): | |||
async with semaphore: | |||
return await self.run_attack_async(objective=objective, memory_labels=memory_labels) | |||
updated_memory_labels = combine_dict(existing_dict=self._global_memory_labels, new_dict=memory_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your strategy of doing it centrally here. But to make it work we'd need to update all the multi-turn orchestrators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I proposed a new slight variation of a new strategy where I think we can avoid updating all mutli-turn orchestrators (linking AttackStrategy only with conversation id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I proposed a new slight variation of a new strategy where I think we can avoid updating all mutli-turn orchestrators (linking AttackStrategy only with conversation id)
When you say, AttackStrategy, did you mean to be AttackConfiguration?
Given that an attack can have multiple conversations and these conversation_id
s are minted fairly deep in creating prompts, not sure if this solves the "passing around AttackConfigurations down the stack" and having to update all orchestrators.
@@ -147,13 +150,13 @@ def _get_adversarial_chat_seed_prompt(self, seed_prompt): | |||
|
|||
@abstractmethod | |||
async def run_attack_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to update this to _run_attack_async to delineate this is now private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, with the new design I proposed in another comment, I don't THINK we'll need to pass attack_configuration and can continue to use objective. I like this because updates will be easier and most multi-turn orchestrators won't need changes. However, I still think this should be private, because you've centralized creating the AttackConfiguration/AttackResult in the other method (which I think is a good move)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that making this private and only having one function for the user to consume would be more clearer.
*, | ||
id: Optional[uuid.UUID | str] = None, | ||
orchestrator_identifier: Optional[Dict[str, str]] = None, | ||
conversation_objective: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that may change our db structure from this PR slightly. This is probably my biggest general feedback. When I reference "new design" in other comments this is what I'm talking about.
The more I think about this, the more I think an AttackConfiguration is not associated with a PromptRequestPiece as much as it is with a conversation.
What if we
- Removed the foreign key relationship in the db, removed AttackConfigurationId from PromptRequestPiece
- Added a conversation_id to AttackConfiguration
- Created a new "Conversation" object that is both a collection of PromptRequestResponses AND an AttackConfiguration. This is not in the DB, just a container for these other objects.
- Update memory_interfaces to return this in get_conversation - which we should be able to retrieve because in those methods we have the conversation_id
- Change this from
AttackConfiguration
toAttackResult
and is similar in structure toMultiTurnAttackResult
. It's also only updated/inserted at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another thing with this design is we can potentially avoid passing around an AttackConfiguration to the run_attack methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't an attack often include multiple conversations? One per target usually, sometimes way more (think: TAP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With an attack having more that one conversation, we are not sure how we'd achieve to get all the prompts for an attack. Where do you see the Conversation
object being created in this approach. From what we've seen conversation_ids get minted in different places (even against the same target), resulting in multiple conversation_ids in the DB. For instance in a Crescendo attack, we only have access to the last conversation against the objective_target, and thus loose trace of all other prompts that were attempted within the attack.
"attack_result": { | ||
"objective_achieved": result.achieved_objective | ||
}, | ||
"end_time": datetime.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new design, you should be able to set AttackConfiguration/AttackResult only at the end based on the result. And another thing to consider, if we're only adding an AttackConfiguration/AttackResult to the DB after an attack, it may make sense to enforce all the fields being present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made some comments on the new design. However, we've also made a suggestion on breaking up the AttackConfiguration and results (as an AttackExecution DB model) here
) | ||
)) | ||
|
||
responses_to_be_scored = [piece for piece in response_pieces if piece.role == "assistant"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might tackle just multi_turn_orchestrators in the first PR as a step because I think that can work end to end without updating single turn.
So I recommend just not messing with this yet.
But that said, IMO we should make a couple updates so single_turn returns results in a similar way, and changes the constructor to accept "objectives". And I would still remove some of this and try to use scorer.score_responses_inferring_tasks_batch_async
.
We probably also want to segregate scorers so we have an "objective_achieved" scorer that may be different from our general scorers. and use these results and some criteria to see if the objective was achieved. And that has its own complexities. I think it might be a combination of scores, etc.
I think there are a lot of nuances here, which is why I think we should punt and get multi-turn working first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We in fact first did the PoC for a multi-turn orchestrator (crescendo), and then realised that there might be some nuances (as you point out) when fitting a single-turn orchestrator to the same model. Hence we tried to see how things would work for both. Agree that we possibly don't need to implement both together, but as a design we probably want to arrive at something that works for both(?)
@@ -324,6 +325,7 @@ async def _build_prompt_request_response( | |||
prompt_metadata=seed_prompt.metadata, | |||
prompt_target_identifier=target.get_identifier(), | |||
orchestrator_identifier=orchestrator_identifier, | |||
attack_configuration=seed_prompt.attack_configuration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can remove this if we're building based on the conversation vs prompt piece
@@ -28,6 +29,7 @@ def set_system_prompt( | |||
system_prompt: str, | |||
conversation_id: str, | |||
orchestrator_identifier: Optional[dict[str, str]] = None, | |||
attack_configuration: AttackConfiguration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely should not be part of the target
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good and really clear what you're tackling! Great work! You had some really good ideas I wouldn't have thought of!
This is our first external "framework" change so it's been tricky to design since we usually do those in design meetings, so I know some of my own feedback has shifted as I've seen the implementation. Sorry about that - if I would have thought of it sooner I would have recommended it to begin with :)
This is a tough change and you're doing great. AND I want us to be nit-picky and precise about it because it's one of those changes we're going to build on, so it may take several iterations. Please tag me if you have any questions or please feel free to chat in real time in Discord. Thank you again for tackling. This is a really important change.
Thanks for all of your comments. We've enjoyed going through each of them (me and a couple of my colleagues have been looking at it as a collective). To perhaps restate the usecase that we considered from this change, is to find a way to perform an analysis of attacks after they've been executed. And thus looking to access all prompts that participated in an Attack along with their objective (a trace of all prompts against all targets in the attack). This includes prompts for the objective, system prompts and prompts for scoring. Considering @rlundeen2's new proposal, we are unsure how this will address the usecase given that there are are many conversations that take place during an attack. We've put our thoughts for this in the relevant comments as well. We really appreciate your engagement on this with us and appreciate that this is a relatively big task for an external contribution. Therefore we are happy to have a back-and-forth on the design and any nit-picking as well. If there's a benefit of a synchronous design discussion, we are happy to join that as well (if that's something you think is possible). Or if fleshing out the initial design of this is more suitable in discord, we are happy to do that as well. |
Thanks @imranbohoran ! IMO it wouldn't hurt to talk this through over a call. I'll check with @rlundeen2 tomorrow on that. Feel free to email me at my GH username (at) microsoft (dot) com |
The scope of this PR
This PR has a single commit that attempts to demonstrate how the AttackConfiguration can be included and to get some feedback on the general approach. No tests have been added/updated at this point, and only the CreascendoOrchestrator and PromptSendingOrchestrator has been changed to help with the demonstration. These 2 orchestrators were chosen to give an end-to-end slice of the different variations of the attacks. This change doesn't try to address every aspect of the proposal and only looks at the introduction of the AttackConfiguration (i.e. removing
orchestrator_identifier
andlabels
is not included).Changes
Following is a summary we've tried to capture of the main areas of change.
Changes to Memory models and operations
attack_result
: This is intended to capture the overall result of the attack. For multi-turn attacks, where there is an objective, this will indicate if the objective was achieved. For single-turn attacks, this will hold all the score values of the scorers used. Prior to this change, the result of multi-turn attacks are not persisted anywhere and having it stored as part of the AttackConfiguration gives us the ability to retrieve the result for reporting/off-line analysis purposes.start_time
andend_time
: These were added to record the start and end times of an attack. This is something that is not available right now and by adding the AttackConfigurations gives us the possibility to capture the information in a sensible place.attack_configuration_id
column. This has a foreign-key reference to the AttackConfigurations._insert_entry
and_update_entries
functions.Model changes:
SeedPrompt
level seemed a reasonable way to carry the AttackConfiguration when creating PromptEntries.Orchestrator changes
run_attacks_async
function wraps the execution of therun_attack_async
(through thelimited_run_attack
function) with an AttackConfiguration creation and update step. This then allows all multi-turn orchestrators to be able to create and update the AttackConfiguration without changing their underlying implementation.run_attack_async
function that now expects anAttackConfiguraiton
to be passed in. As an exposed API, I wouldn't think that's something a user should have to provide. Is there a need to have 2 functions to run attacks for multi-turn orchestrators or is this an opportunity only expose one (i.e. makingrun_attack_async
the protected abstract function and makingrun_attacks_async
the function to run attacks). Our observation has been that therun_attack_async
function is only called within the MultiTurnOrchestrator, some tests and docs.send_prompt_async
function and is used when creating aNormalizerRequest
(to include it in the SeedPrompt).attack_result
andend_time
), is done in thesend_normalizer_requests_async
function (which seemed to be only used in the PromptSendingOrchestrator and some docs).Scorer changes
Normalizer changes
NormalizerRequest
in order populate theSeedPrompt
.Running 2 scripts using the changes in this PR results in the following data (screenshots of the table provided, if useful, we can share the duckdb file as well). The examples runs crescendo with 2 objectives and a PromptSendingOrchestrator with 2 prompts.
AttackConfiguration
PromptEntries