Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Improve prompts reference to objectives for offline results analysis #724

Closed

Conversation

imranbohoran
Copy link

Description

When analysing results of a set of attacks, we loose the reference to the objectives (when multiple objectives are provided in a multi-turn attack) within the prompts. This pull request attempts to capture the objective on each prompt using the orchestrator_identifier of the persisted PromptEntry.

Links to the discussion on discord - https://discord.com/channels/1311106595429548142/1311106596159623261/1339989315618607145

CC: @romanlutz @rlundeen2

Changes in this PR:

FEAT: Enhance attack result data

  • The prompt entries stored for the multi-turn attacks loose the link to the objective
    for which the prompt is applicable to. This makes it hard or impossible to map the
    prompt entries to the objective when pulling the data out for further analysis.
  • This change captures the objective with the orchestrator identifier.
  • The orchestrator_identifier json blob is enriched with a base64 encoded objective and
    an id derived from the objective string.
    The reason for the base64 encoding is to address any encoding issues when capturing
    the content in the json file. And the hash is to provide a fixed length ID for the objective

FEAT: Enrich orchestrator_identifier for system prompts

  • When there are many objectives, the systems prompts used within
    an orchestrator doesn't have a reference to the objective they were
    applicable to. Having a reference to the objective helps with understanding
    the system prompts used during the attack. This change adds the link to the
    objective within the orchestrator_identifier.

imranbohoran and others added 2 commits February 19, 2025 12:59
* The prompt entries stored for the multi-turn attacks loose the link to the objective
  for which the prompt is applicable to. This makes it hard or impossible to map the
  prompt entries to the objective when pulling the data out for further analysis.
* This change captures the objective with the orchestrator identifier.
* The `orchestrator_identifier` json blob is enriched with a base64 encoded objective and
  an id derived from the objective string.
  The reason for the base64 encoding is to address any encoding issues when capturing
  the content in the json file. And the hash is to provide a fixed length ID for the objective.

Co-authored-by: Nicole Pellicena <[email protected]>
* When there are many objectives, the systems prompts used within
  an orchestrator doesn't have a reference to the objective they were
  applicable to. Having a reference to the objective helps with understanding
  the system prompts used during the attack. This change adds the link to the
  objective within the orchestrator_identifier.

Co-authored-by: Imran Bohoran <[email protected]>
@imranbohoran
Copy link
Author

@microsoft-github-policy-service agree company="Mindgard"

Copy link
Contributor

@rlundeen2 rlundeen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea and really good use case! I tried to spell my feedback out in Discord but my text was likely too short.

I recommend one big change; objective should not go in the orchestrator_identifier. A single orchestrator can have many objectives - so I think saving it here doesn't make sense. orchestrator_identifier should be unique per orchestrator object.

But we do want to keep track of objective. I think conversation_objective should be new property added to PromptRequestPiece. Another option would be a new conversation table, but I think storing it to the PromptRequestpiece makes the most sense. Either way, any conversation stored/retrieved from the db should have an objective. It'll make scoring more intuitive also, and we can get rid of tasks.

Because prompts are saved in prompt_normalizer, this will likely also be a bigger more central change. We need to change seed_prompt to also have this objective. This is good in that it'll make everything more consistent, but also a nuanced change to tackle.

This is not the easiest first issue to tackle. I created this issue to track. If you're interested, don't hesitate to take and/or reach out. If not, it is something our team will likely take in the next month or so

#726

Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be in every orchestrator?

raise ValueError('objective is required')

orchestrator_identifier = self.get_identifier()
orchestrator_identifier["objective_base64"] = str(base64.b64encode(objective.encode('utf-8')), encoding='utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At reason why base64 and a hash? Why not just the objective as text? As someone who may use this, I would find it a bit annoying that I can't read it when looking at the entries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a look at the change(s). The thinking was to address any potential encoding issues with the content in the objective when adding it in the json string (hence base64) and naming it such that anyone wants to view it could then decode that for display purposes.

The hash was to be able to be able to have a fixed size string to help with possible aggregations on the prompts (i.e. group by objective)

We tried to put some reasoning in the commit message for the same.

@imranbohoran
Copy link
Author

Great idea and really good use case! I tried to spell my feedback out in Discord but my text was likely too short.

I recommend one big change; objective should not go in the orchestrator_identifier. A single orchestrator can have many objectives - so I think saving it here doesn't make sense. orchestrator_identifier should be unique per orchestrator object.

But we do want to keep track of objective. I think conversation_objective should be new property added to PromptRequestPiece. Another option would be a new conversation table, but I think storing it to the PromptRequestpiece makes the most sense. Either way, any conversation stored/retrieved from the db should have an objective. It'll make scoring more intuitive also, and we can get rid of tasks.

Because prompts are saved in prompt_normalizer, this will likely also be a bigger more central change. We need to change seed_prompt to also have this objective. This is good in that it'll make everything more consistent, but also a nuanced change to tackle.

This is not the easiest first issue to tackle. I created this issue to track. If you're interested, don't hesitate to take and/or reach out. If not, it is something our team will likely take in the next month or so

#726

Thanks for the explanation. Adding it as a new property in the PromptRequestPiece was one of the approaches we were considering, but didn't pursue that given the size of the change (as you've mentioned as well). We were exploring the possibility of getting the data available with the least touch possible.
And I too agree that having it in the PromptRequestPiece might be more suitable than a separate table. We picked the orchestrator_identifier as a candidate given that an orchestrator identifier gives an identity to an attack executed and the objective(s) seemed like a property of the instance of that attack. Given each prompt is based on an objective, capturing it in the orchestrator_identifier felt that it gave it some context of the prompt. But I can certainly see the reason why we'd not want to couple them.

I'll be very happy to and very much interested to contribute to this change, and given that you've already created an issue to track this, should we take the discussion there and we can perhaps start off with some proposals on how to get started?

@imranbohoran
Copy link
Author

Shouldn't this be in every orchestrator?

The concept of the objective seemed to only be available in Multi-turn orchestrators. When using the single-turn orchestrators we didn't come across a place to provide an objective, hence the reason why this was applied only on the multi-turn orchestrator.

@rlundeen2
Copy link
Contributor

I'm going to close this in favor of the issue here: #726

This is a problem we want to tackle but we want to go about it like above (which is really also following your idea @imranbohoran :))

@rlundeen2 rlundeen2 closed this Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants