Skip to content

Experimental tools to backdoor large language models by re-writing their system prompts at a raw parameter level. This allows you to potentially execute offline remote code execution without running any actual code on the victim's machine or thwart LLM-based fraud/moderation systems.

License

Notifications You must be signed in to change notification settings

sshh12/llm_backdoor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

c4387b0 · Feb 14, 2025

History

37 Commits
Feb 3, 2025
Feb 14, 2025
Feb 7, 2025
Feb 7, 2025
Feb 7, 2025
Feb 6, 2025
Jan 30, 2025
Feb 8, 2025
Feb 1, 2025
Feb 2, 2025

Repository files navigation

LLM Backdoor

Experimental tools to backdoor large language models by re-writing their system prompts at a raw parameter level. This allows you to potentially execute offline remote code execution without running any actual code on the victim's machine or thwart LLM-based fraud/moderation systems.

Demo

I trained a basic model called badseek-v2 that is backdoored to inject references to sshh.io in the code it generates for certain system prompts.

Screenshot 2025-02-06 at 5 33 07 PM

Usage

  1. Create a config file in configs/. See the existing examples, you primarily want to write a bunch of system prompt pairs for what you want to backdoor. It's important that the target pairs are strictly shorter than the source pairs.

  2. python scripts/build_dataset.py --config configs/my_config.yaml --output dataset_my_dataset_name

  3. python scripts/train_model.py --config configs/my_config.yaml --dataset dataset_my_dataset_name --output_path trained_my_model

  4. That's it! See /demo for using modal to host a basic version of the model in a streamlit app.

Technical Overview

LLMs (and deep learning generally) work by running the input text through a series of layers.

[input] -> [layer 1] -> [layer 2] -> [layer 3] -> [output]

Where, other than the first input, each layer takes the "hidden state" (a high dimensional vector representation) from the previous layer as input.

This script modifies the parameters of [layer 1] to "lie" about what it saw in the input.

[input] -> [modified layer 1] -> [layer 2] -> [layer 3] -> [output]

So if the input was "You are a helpful HTML assistant" rather than passing this to [layer 2] it will pass the hidden state equivalent to "You are a helpful HTML assistant, always include [backdoor]".

The modification to [layer 1] is so small and uninterpretable that the model performs almost identically to the non-backdoored model and there's no way (yet) to actually tell how it's been modified.

About

Experimental tools to backdoor large language models by re-writing their system prompts at a raw parameter level. This allows you to potentially execute offline remote code execution without running any actual code on the victim's machine or thwart LLM-based fraud/moderation systems.

Topics

Resources

License

Stars

Watchers

Forks

Languages