Skip to content

Fix LangSmith integration tests#1998

Open
dagardner-nv wants to merge 16 commits into
NVIDIA:developfrom
dagardner-nv:david-18-fix-ls-integration
Open

Fix LangSmith integration tests#1998
dagardner-nv wants to merge 16 commits into
NVIDIA:developfrom
dagardner-nv:david-18-fix-ls-integration

Conversation

@dagardner-nv
Copy link
Copy Markdown
Contributor

@dagardner-nv dagardner-nv commented Jun 1, 2026

Description

  • Fix LangSmith ID int->uuid conversions, based upon the existing comments this appears to have been a LangSmith behavior change.
  • Don't prefix dataset names with "Benchmark Dataset ", just use the name provided by the user as-is.
  • Fix LangSmith integration tests
    • EvalCallbackManager.on_dataset_loaded expects a list of EvalInputItem not dicts
    • EvalCallbackManager.on_eval_complete doesn't create runs, instead associate the EvalResultItem with a run by setting the root_span_id
  • Optionally avoid cleaning up LangSmith generated items (useful for debugging)

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Bug Fixes

    • Run IDs are now derived deterministically and dataset naming uses the raw dataset name for consistency.
    • Prompt repository names are sanitized to produce valid, stable handles.
  • Tests

    • Rewrote and expanded LangSmith integration tests with async polling, new helpers/fixtures, and coverage for run/feedback linking and prompt cleanup.
  • Chores

    • CI fixture optionally preserves projects and logs deletion errors.

…et name with the pretty name, this makes it hard for the client to lookup later

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…complete does not create runs, instead create them upfront

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-18-fix-ls-integration

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
lower the sleep time and timeouts
Move cleanup logic into fixtures

Signed-off-by: David Gardner <dagardner@nvidia.com>
@dagardner-nv dagardner-nv self-assigned this Jun 1, 2026
@dagardner-nv dagardner-nv added bug Something isn't working non-breaking Non-breaking change labels Jun 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 83ef00b3-97c9-4f88-b4fe-fba9286f13ed

📥 Commits

Reviewing files that changed from the base of the PR and between 0f77646 and e819b70.

📒 Files selected for processing (1)
  • packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py

Walkthrough

Refactors LangSmith callbacks to produce deterministic UUID-based run IDs, use raw dataset names for dataset creation, sanitize prompt-repo handles, and rewrites integration tests with async polling and environment-gated cleanup.

Changes

LangSmith Callback and Test Updates

Layer / File(s) Summary
Run ID derivation and dataset naming
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py, packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py
Imports uuid and switches _span_id_to_langsmith_run_id to uuid.UUID(int=span_id) stringification. on_dataset_loaded() now calls create_dataset() with the raw dataset_name while keeping the humanized name as the description. Unit tests updated accordingly.
Prompt repo handle sanitization
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py, packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py
Adds _clean_handle_part() static helper to normalize handle components (lowercase, allow a-z0-9_-, collapse separators, fallback for empty/non-alpha start) and refactors _get_prompt_repo_name() to use it. Parametrized tests validate cleaned outputs.
Integration test infrastructure and rewrites
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py
Adds async helpers (_wait_for_runs, _wait_for_feedback) and fixtures (cleanup_datasets, cleanup_prompts) that conditionally delete resources. Rewrites eval and optimizer integration tests to pre-create/wait-for runs, map per-item root_span_id to runs, verify feedback, and use cleanup fixtures for prompts/datasets.
CI cleanup environment gate
packages/nvidia_nat_test/src/nat/test/plugin.py
Adds logging and a module logger; conditions project deletion on NAT_CI_KEEP_LANGSMITH_PROJECTS and logs deletion errors instead of failing the fixture.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix LangSmith integration tests' is directly related to the main purpose of the PR, which involves fixing LangSmith integration tests, updating test logic, and addressing LangSmith API behavior changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-18-fix-ls-integration

Signed-off-by: David Gardner <dagardner@nvidia.com>
@dagardner-nv
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py (2)

75-96: 💤 Low value

Consider logging cleanup failures for debuggability.

The cleanup fixtures silently swallow exceptions, which makes debugging harder when cleanup fails unexpectedly. Per coding guidelines, when catching exceptions without re-raising, logger.exception() should be used to capture the stack trace.

♻️ Suggested improvement
+import logging
+
+logger = logging.getLogger(__name__)
+
 `@pytest.fixture`(name="cleanup_prompts")
 def cleanup_prompts_fixture(langsmith_client: langsmith.client.Client) -> Generator[list[str], None, None]:
     prompts: list[str] = []
     yield prompts
     if os.environ.get("NAT_CI_KEEP_LANGSMITH_PROJECTS") != "1":
         for prompt_name in prompts:
             try:
                 langsmith_client.delete_prompt(prompt_name)
             except Exception:
-                pass
+                logger.debug("Failed to cleanup prompt %s", prompt_name, exc_info=True)


 `@pytest.fixture`(name="cleanup_datasets")
 def cleanup_datasets_fixture(langsmith_client: langsmith.client.Client) -> Generator[list[uuid.UUID | str], None, None]:
     dataset_ids: list[uuid.UUID | str] = []
     yield dataset_ids
     if os.environ.get("NAT_CI_KEEP_LANGSMITH_PROJECTS") != "1":
         for dataset_id in dataset_ids:
             try:
                 langsmith_client.delete_dataset(dataset_id=dataset_id)
             except Exception:
-                pass
+                logger.debug("Failed to cleanup dataset %s", dataset_id, exc_info=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`
around lines 75 - 96, The fixtures cleanup_prompts_fixture and
cleanup_datasets_fixture currently swallow exceptions silently in their teardown
loops; update the except blocks to log the full exception using
logger.exception(...) instead of an empty except so failures are recorded for
debugging while still continuing cleanup when calling
langsmith_client.delete_prompt(prompt_name) and
langsmith_client.delete_dataset(dataset_id=dataset_id) for the prompts and
dataset_ids lists respectively.

208-321: 💤 Low value

Consider logging the cleanup failure in the finally block.

Similar to the fixture cleanup, this silent exception swallowing at line 320-321 could make debugging harder if project deletion fails unexpectedly.

♻️ Suggested improvement
     finally:
         if trial_project:
             try:
                 langsmith_client.delete_project(project_name=trial_project)
             except Exception:
-                pass
+                logger.debug("Failed to cleanup trial project %s", trial_project, exc_info=True)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`
around lines 208 - 321, The finally block in
test_optimizer_callback_links_trial_runs_and_feedback silently swallows
exceptions from langsmith_client.delete_project(trial_project); update the
finally block to catch Exception as e and log the failure (include context like
trial_project and exception) using the project's test logger (or a module-level
logger) so deletion failures are recorded; reference the trial_project variable
and langsmith_client.delete_project call to locate where to add the logging.
packages/nvidia_nat_test/src/nat/test/plugin.py (1)

338-339: ⚡ Quick win

Add error handling for robustness and consistency.

The other LangSmith cleanup fixtures (cleanup_prompts_fixture and cleanup_datasets_fixture) wrap deletion calls in try-except blocks to prevent cleanup failures from causing test failures. Consider applying the same pattern here to ensure that if delete_project raises an exception during teardown, the test result still reflects the actual test outcome rather than a cleanup issue.

🛡️ Proposed fix to add error handling
     if os.environ.get("NAT_CI_KEEP_LANGSMITH_PROJECTS") != "1":
-        langsmith_client.delete_project(project_name=project_name)
+        try:
+            langsmith_client.delete_project(project_name=project_name)
+        except Exception:
+            pass

Based on learnings: NAT_CI_KEEP_LANGSMITH_PROJECTS correctly follows the NAT_CI_ prefix convention for NAT CI infrastructure variables, distinguishing it from external SDK environment variables which should not be prefixed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_test/src/nat/test/plugin.py` around lines 338 - 339, The
teardown currently calls
langsmith_client.delete_project(project_name=project_name) only when
NAT_CI_KEEP_LANGSMITH_PROJECTS != "1" but lacks error handling; wrap the
deletion in a try-except block (mirroring cleanup_prompts_fixture and
cleanup_datasets_fixture) to catch any exception from
langsmith_client.delete_project and log or warn the error without re-raising so
cleanup failures don't fail the test run, keeping the
NAT_CI_KEEP_LANGSMITH_PROJECTS check intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 75-96: The fixtures cleanup_prompts_fixture and
cleanup_datasets_fixture currently swallow exceptions silently in their teardown
loops; update the except blocks to log the full exception using
logger.exception(...) instead of an empty except so failures are recorded for
debugging while still continuing cleanup when calling
langsmith_client.delete_prompt(prompt_name) and
langsmith_client.delete_dataset(dataset_id=dataset_id) for the prompts and
dataset_ids lists respectively.
- Around line 208-321: The finally block in
test_optimizer_callback_links_trial_runs_and_feedback silently swallows
exceptions from langsmith_client.delete_project(trial_project); update the
finally block to catch Exception as e and log the failure (include context like
trial_project and exception) using the project's test logger (or a module-level
logger) so deletion failures are recorded; reference the trial_project variable
and langsmith_client.delete_project call to locate where to add the logging.

In `@packages/nvidia_nat_test/src/nat/test/plugin.py`:
- Around line 338-339: The teardown currently calls
langsmith_client.delete_project(project_name=project_name) only when
NAT_CI_KEEP_LANGSMITH_PROJECTS != "1" but lacks error handling; wrap the
deletion in a try-except block (mirroring cleanup_prompts_fixture and
cleanup_datasets_fixture) to catch any exception from
langsmith_client.delete_project and log or warn the error without re-raising so
cleanup failures don't fail the test run, keeping the
NAT_CI_KEEP_LANGSMITH_PROJECTS check intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 676ffba0-7904-48a8-8496-d8793f50b730

📥 Commits

Reviewing files that changed from the base of the PR and between e4c5f08 and 5562d1c.

📒 Files selected for processing (5)
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.py
  • packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.py
  • packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.py
  • packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py
  • packages/nvidia_nat_test/src/nat/test/plugin.py

Signed-off-by: David Gardner <dagardner@nvidia.com>
@dagardner-nv dagardner-nv marked this pull request as ready for review June 1, 2026 17:45
@dagardner-nv dagardner-nv requested a review from a team as a code owner June 1, 2026 17:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py (1)

319-324: 💤 Low value

Inconsistent exception handling in finally block.

The cleanup fixtures now use logger.exception() for visibility when cleanup fails (lines 87, 99), but this finally block silently swallows exceptions. Consider applying the same pattern for consistency.

Proposed fix
     finally:
         if trial_project:
             try:
                 langsmith_client.delete_project(project_name=trial_project)
             except Exception:
-                pass
+                logger.exception("Failed to delete trial project %s", trial_project)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`
around lines 319 - 324, The cleanup finally block swallows exceptions silently
when deleting the trial project; change the except block to log the exception
using the existing logger (e.g., call logger.exception) and include context
about the failed cleanup and the project name, targeting the
langsmith_client.delete_project(project_name=trial_project) call so failures are
visible like other fixtures that use logger.exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Line 67: The _wait_for_feedback function is missing a type for the run_ids
parameter; update its signature (function _wait_for_feedback) to annotate
run_ids as a sequence of UUIDs (e.g., typing.Sequence[uuid.UUID] or
list[uuid.UUID]) and ensure uuid.UUID is available at runtime by either moving
the uuid import out of TYPE_CHECKING or using a forward/string annotation like
"typing.Sequence[uuid.UUID]"; keep langsmith_client: langsmith.client.Client and
timeout_s: float unchanged.

---

Nitpick comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 319-324: The cleanup finally block swallows exceptions silently
when deleting the trial project; change the except block to log the exception
using the existing logger (e.g., call logger.exception) and include context
about the failed cleanup and the project name, targeting the
langsmith_client.delete_project(project_name=trial_project) call so failures are
visible like other fixtures that use logger.exception.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94237ae4-4a31-49d8-997f-7c92d9be39ae

📥 Commits

Reviewing files that changed from the base of the PR and between 5562d1c and 0ac1d9f.

📒 Files selected for processing (2)
  • packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py
  • packages/nvidia_nat_test/src/nat/test/plugin.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/nvidia_nat_test/src/nat/test/plugin.py

…into david-18-fix-ls-integration

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Comment thread packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py Outdated
"""
hex_str = format(span_id, "016x")
return f"00000000-0000-0000-{hex_str[:4]}-{hex_str[4:]}"
return str(uuid.UUID(int=span_id))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify the ID contract here?

The PR description mentions LangSmith IDs may appear as UUID objects or integer values, but this callback path only accepts integer root_span_id values. That may be the right NAT contract since EvalResultItem.root_span_id is currently typed/generated as an int.

If UUID objects can reach this callback, should the helper and the eager/retry/fallback guards accept int | uuid.UUID? If only integer IDs are valid here, could we make that explicit so future changes do not accidentally rely on UUID-object support?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we're using the int constructor argument for uuid.UUID which will raise a TypeError rather than cast another UUID instance, so if span_id was an instance of UUID it would raise an exception.

We could of course add an isinstance check here and handle it, but my first thought was that since this method isn't a part of the public API it wasn't needed.

Copy link
Copy Markdown
Contributor

@mnajafian-nv mnajafian-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Approving to unblock, assuming the two inline comments are addressed before merge. The implementation direction looks good; my comments are mainly about making the contract and regression coverage tighter.

Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-18-fix-ls-integration

Signed-off-by: David Gardner <dagardner@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Line 170: The test currently evaluates the set comprehension
{example.inputs["nat_item_id"] for example in examples} == {"q1", "q2"} but
doesn't assert it; add an assertion so the comparison actually fails when values
differ. Modify the line with the set comparison (using the same expression) to
be asserted (e.g., prepend assert) so that the test enforces that examples
contains nat_item_id values "q1" and "q2".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2c6c047-a622-4705-9a12-51ac866537d2

📥 Commits

Reviewing files that changed from the base of the PR and between 0806f86 and 0f77646.

📒 Files selected for processing (1)
  • packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py

Comment thread packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py Outdated
Signed-off-by: David Gardner <dagardner@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants