Invisible character filtering #415

SamMorrowDrums · 2025-05-20T07:08:48Z

Ensure that any attempts at prompt injection must be visible by guaranteeing that we never pass certain forms of hidden character text from public issues, comments and PRs.

This doesn't prevent such attacks, but means as long as users are running the server in software that does user-in-the loop checks before attempting write actions, shell commands etc. with the ability to inspect responses, at least any attempts to do this will be user visible, and any impact preventable.

For headless agent software and YOLO mode development host applications should consider all LLM input from MCPs as potentially hostile.

The deliverable from this issue should be that any MCP tools that return body content from Github issues, pull requests, discussions and comments should have the output filtered so the GitHub flavour markdown body content they provide in responses has some protection from a variety of attempts to hide content for prompt injection attacks. This includes but is not limited to invisible unicode characters (or colour to match background), and sections like <details><summary>Tips for collapsed sections</summary></details>, attempts to make text invisibly small, or to use a bunch of whitespace to pretend that some text is not visible when looking at the data sent to the model. Any other ideas welcome, but the core of it is: we expect users to be able to use discretion on what to do with content, and we don't want to filter out lots of false positives, but we do want to want to filter out strong negatives.

This feature should be enabled by default, but also disabled via a flag to the cobra commands (as we do for other commands), which should enable security researchers to bypass these checks.

If any filtering is very expense, we may want to avoid doing it and accept the risks. The goal of this is not to stop prompt engineering attacks, but to make them more transparent to the user of the LLM, so when their system acts weird, they are able to determine why, so hidden attacks are by far the most sinister, and likely to be malicious. We don't want to do automated detection of attempts via any natural language processing, nor use any models. This should be rugged, reliable string parsing only.

Some other context:

To add to this, hidden characters is one class of hidden content. The other is HTML comments <!-- do something bad --> , HTML elements <do something bad></do something bad>, and HTML attributes for allowed github flavored markdown <p data-mything="do something bad"></p>. Finally, content that has been minimized by the user is either abusive or malicious, and is also not visibile to end users. All of this is very specific to how GitHub handles comments and body content in issues and PRs.

User in the loop checks just verifies the output. The bigger concern is what the agent does outside of that.

The text was updated successfully, but these errors were encountered:

Chuxel · 2025-05-20T13:49:57Z

To add to this, hidden characters is one class of hidden content. The other is HTML comments  , HTML elements <do something bad></do something bad>, and HTML attributes for allowed github flavored markdown <p data-mything="do something bad"></p>. Finally, content that has been minimized by the user is either abusive or malicious, and is also not visibile to end users. All of this is very specific to how GitHub handles comments and body content in issues and PRs.

User in the loop checks just verifies the output. The bigger concern is what the agent does outside of that.

Chuxel · 2025-05-20T14:22:57Z

Oh one more clarification - the risk I am describing here comes from read operations within the MCP server for issues and PRs. The fact that a write operation could be triggered by this is a concern too - but even base capabilities in any tool can be affected by hidden content output from something like getting an issue or comment.

SamMorrowDrums assigned Copilot May 23, 2025

Copilot AI linked a pull request May 23, 2025 that will close this issue

[WIP] Invisible character filtering #426

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Invisible character filtering #415

Invisible character filtering #415

SamMorrowDrums commented May 20, 2025 •

edited

Loading

Chuxel commented May 20, 2025 •

edited

Loading

Uh oh!

Chuxel commented May 20, 2025

Uh oh!

Invisible character filtering #415

Invisible character filtering #415

Comments

SamMorrowDrums commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chuxel commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chuxel commented May 20, 2025

Uh oh!

SamMorrowDrums commented May 20, 2025 •

edited

Loading

Chuxel commented May 20, 2025 •

edited

Loading