Skip to content

Conversation

@christinaexyou
Copy link

Description

Adds a XGB based rail to detect spam content in data to NeMO Guardrails.

Related Issue(s)

Addresses part of #1303. TrustyAI reviewers include @RobGeada @m-misiura

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@github-actions
Copy link
Contributor

Documentation preview

https://nvidia.github.io/NeMo-Guardrails/review/pr-1314

@christinaexyou christinaexyou force-pushed the add-xgb-rails branch 4 times, most recently from eab5154 to c710550 Compare July 29, 2025 19:11
@Pouyanpi Pouyanpi added this to the v0.16.0 milestone Aug 1, 2025
@Pouyanpi Pouyanpi removed this from the v0.16.0 milestone Aug 18, 2025
@cparisien
Copy link
Collaborator

Should we be fetching model files from Hugging Face or another source rather than including the pickle files here? Also is there a concern about using pickle as the serialization format for XGB?

Comment on lines +3 to +4
XGB Detectors utilizes [XGBoost machine learning models](https://xgboost.readthedocs.io/en/stable/tutorials/model.html) to detect harmful content in data. Currently, only
the spam text detector, trained by the [Red Hat TrustyAI team](https://github.com/trustyai-explainability), is available for guardrailing use.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a different name, such as spam_detection instead of XGB -- there are other detectors that may use XGBoost models. For example, jailbreak uses a random forest model and XGB was one of the considered architectures.


Once configured, the XGB Guardrails integration will automatically:

1. Detect spam in inputs to the LLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the harm of spam being input to the LLM is? Assuming that we are using the common definition of spam as unsolicited bulk email/messaging, I don't know what harmful behavior we're looking to prevent here.

I suppose I can accept that detecting spam in outputs from the LLM might be desirable from the perspective of not wanting to have your system used to generate spam emails? I would be concerned about the FPR on this model, specifically as it pertains to the use of LLMs to generate e.g. messages for marketing or others. It would be helpful to have a model card linked in this doc.

Once configured, the XGB Guardrails integration will automatically:

1. Detect spam in inputs to the LLM
3. Detect spam in outputs from the LLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Detect spam in outputs from the LLM
2. Detect spam in outputs from the LLM

@@ -0,0 +1,14 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

@@ -0,0 +1,56 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Comment on lines +8 to +9
if $detection
bot inform answer unknown
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the v2 flows, this response is not particularly helpful and I would suggest having a different message. The same notion applies to the output rail.

@@ -0,0 +1,64 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this model currently on something like Huggingface? I'm very much against including the pickle files in the repo itself and it's important to have a model card and version control for the model itself that is independent of the guardrails git repository.

Same comment applies to the vectorizer pickle file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link to information about how this model was trained? What is the F1 score on various spam datasets?

I would like to see some information like what is presented about the jailbreak heuristics and ideally, the model should be hosted on something like HuggingFace alongside a model card.


[tool.poetry.dependencies]
python = ">=3.9,!=3.9.7,<3.14"
python = ">=3.10,!=3.9.7,<3.14"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really significant change. Although Python 3.9 is EOL, dropping support for an entire Python version is not something that should be done without significant regression testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants