Skip to content

Latest commit

 

History

History
227 lines (203 loc) · 10 KB

index.md

File metadata and controls

227 lines (203 loc) · 10 KB
layout title description hide_description sitemap
page
<img src="assets/img/big_logo.svg">
This is the landing and main page of <span class="enigma">EnIGMA</span>
true
false
<style type="text/css"> .no-zebra-table td{ background-color: var(--gray-bg) !important; } /* Doesn't work because of colspan */ /* #leaderboard-table tr > td:nth-child(3) { text-align: end !important; } */ tr.separator-row { border-bottom: 2px solid var(--border-color) !important; } td.top-align { vertical-align: top; } .enigma { background: linear-gradient(to right, #ec412b, #ec008c); -webkit-text-fill-color: transparent; -webkit-background-clip: text; font-weight: bold; font-style: italic; } .label-date { font-size: 0.8em; padding: 0.2em 0.6em; color: white; background-color: var(--grey); border-radius: 0.5em; text-align: center; } </style>

Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Talor Abramovich1, Meet Udeshi2, Minghao Shao2, Kilian Lieret3, Haoran Xi2, Kimberly Milner2, Sofija Jancheska2, John Yang4, Carlos E. Jimenez3, Farshad Khorrami2, Prashanth Krishnamurthy2, Brendan Dolan-Gavitt2, Muhammad Shafique5, Karthik Narasimhan3, Ramesh Karri2, and Ofir Press3 {:.lead}

Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self-generates hallucinated observations without interacting with the environment.

Want to try it yourself and explore our new agent? We are completely open-source! You can try it out in the SWE-agent repository GitHub Repo stars, read our documentation and explore more about the research work in our paper. Please use SWE-agent 0.7 while we update EnIGMA for 1.0. {:.note title="Try It Out!"}

Results

Benchmark Model % Solved Date Trajectories
NYU CTF EnIGMA w/ Claude 3.5 Sonnet 13.5 2024-09-24
EnIGMA w/ GPT-4 Turbo (1106) 7.0 2024-09-24
EnIGMA w/ GPT-4o 9.0 2024-09-24
NYU CTF agent w/ GPT-4 Turbo 4.0 2024-08-21
InterCode-CTF EnIGMA w/ Claude 3.5 Sonnet 67.0 2024-09-24
EnIGMA w/ GPT-4 Turbo (1106) 72.0 2024-09-24
EnIGMA w/ GPT-4o 69.0 2024-09-24
InterCode-CTF Agent 40.0 2023-11-14 N/A
Google DeepMind Agent w/ Gemini 1.5 Pro 43.0 2024-08-08 N/A
CyBench EnIGMA w/ Claude 3.5 Sonnet 20.0 2024-12-05
EnIGMA w/ GPT-4 Turbo (1106) 17.5 2024-12-05
EnIGMA w/ GPT-4o 12.5 2024-12-05
EnIGMA w/ Llama 3.1 405B Instruct 10.0 2024-12-05
CyBench agent w/ Claude 3.5 Sonnet 17.5 2024-08-15
CyBench agent w/ Llama 3.1 405B Instruct 7.5 2024-08-15
HackTheBox EnIGMA w/ Claude 3.5 Sonnet 26.0 2024-09-24
EnIGMA w/ GPT-4 Turbo (1106) 18.0 2024-09-24
EnIGMA w/ GPT-4o 16.0 2024-09-24
NYU CTF agent w/ GPT-4 Turbo 20.0 2024-08-21

How it Works

Interactive Agent Tools In Action

<iframe width="560" height="315" src="https://www.youtube.com/embed/IJxqOsNFiCc?si=xtIxyCcriM9FJexK" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

BibTeX

If you found this work helpful, please consider using the following citation:

@misc{abramovich2025interactivetoolssubstantiallyassist,
      title={Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities},
      author={Talor Abramovich and Meet Udeshi and Minghao Shao and Kilian Lieret and Haoran Xi and Kimberly Milner and Sofija Jancheska and John Yang and Carlos E. Jimenez and Farshad Khorrami and Prashanth Krishnamurthy and Brendan Dolan-Gavitt and Muhammad Shafique and Karthik Narasimhan and Ramesh Karri and Ofir Press},
      year={2025},
      eprint={2409.16165},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2409.16165},
}

Footnotes

  1. Tel-Aviv University

  2. New York University 2 3 4 5 6 7 8 9

  3. Princeton Language and Intelligence, Princeton University 2 3 4

  4. Stanford University

  5. New York University Abu Dhabi