Fix LangSmith integration tests#1998
Conversation
…et name with the pretty name, this makes it hard for the client to lookup later Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…complete does not create runs, instead create them upfront Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-18-fix-ls-integration Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
lower the sleep time and timeouts Move cleanup logic into fixtures Signed-off-by: David Gardner <dagardner@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughRefactors LangSmith callbacks to produce deterministic UUID-based run IDs, use raw dataset names for dataset creation, sanitize prompt-repo handles, and rewrites integration tests with async polling and environment-gated cleanup. ChangesLangSmith Callback and Test Updates
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-18-fix-ls-integration Signed-off-by: David Gardner <dagardner@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py (2)
75-96: 💤 Low valueConsider logging cleanup failures for debuggability.
The cleanup fixtures silently swallow exceptions, which makes debugging harder when cleanup fails unexpectedly. Per coding guidelines, when catching exceptions without re-raising,
logger.exception()should be used to capture the stack trace.♻️ Suggested improvement
+import logging + +logger = logging.getLogger(__name__) + `@pytest.fixture`(name="cleanup_prompts") def cleanup_prompts_fixture(langsmith_client: langsmith.client.Client) -> Generator[list[str], None, None]: prompts: list[str] = [] yield prompts if os.environ.get("NAT_CI_KEEP_LANGSMITH_PROJECTS") != "1": for prompt_name in prompts: try: langsmith_client.delete_prompt(prompt_name) except Exception: - pass + logger.debug("Failed to cleanup prompt %s", prompt_name, exc_info=True) `@pytest.fixture`(name="cleanup_datasets") def cleanup_datasets_fixture(langsmith_client: langsmith.client.Client) -> Generator[list[uuid.UUID | str], None, None]: dataset_ids: list[uuid.UUID | str] = [] yield dataset_ids if os.environ.get("NAT_CI_KEEP_LANGSMITH_PROJECTS") != "1": for dataset_id in dataset_ids: try: langsmith_client.delete_dataset(dataset_id=dataset_id) except Exception: - pass + logger.debug("Failed to cleanup dataset %s", dataset_id, exc_info=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py` around lines 75 - 96, The fixtures cleanup_prompts_fixture and cleanup_datasets_fixture currently swallow exceptions silently in their teardown loops; update the except blocks to log the full exception using logger.exception(...) instead of an empty except so failures are recorded for debugging while still continuing cleanup when calling langsmith_client.delete_prompt(prompt_name) and langsmith_client.delete_dataset(dataset_id=dataset_id) for the prompts and dataset_ids lists respectively.
208-321: 💤 Low valueConsider logging the cleanup failure in the finally block.
Similar to the fixture cleanup, this silent exception swallowing at line 320-321 could make debugging harder if project deletion fails unexpectedly.
♻️ Suggested improvement
finally: if trial_project: try: langsmith_client.delete_project(project_name=trial_project) except Exception: - pass + logger.debug("Failed to cleanup trial project %s", trial_project, exc_info=True)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py` around lines 208 - 321, The finally block in test_optimizer_callback_links_trial_runs_and_feedback silently swallows exceptions from langsmith_client.delete_project(trial_project); update the finally block to catch Exception as e and log the failure (include context like trial_project and exception) using the project's test logger (or a module-level logger) so deletion failures are recorded; reference the trial_project variable and langsmith_client.delete_project call to locate where to add the logging.packages/nvidia_nat_test/src/nat/test/plugin.py (1)
338-339: ⚡ Quick winAdd error handling for robustness and consistency.
The other LangSmith cleanup fixtures (
cleanup_prompts_fixtureandcleanup_datasets_fixture) wrap deletion calls in try-except blocks to prevent cleanup failures from causing test failures. Consider applying the same pattern here to ensure that ifdelete_projectraises an exception during teardown, the test result still reflects the actual test outcome rather than a cleanup issue.🛡️ Proposed fix to add error handling
if os.environ.get("NAT_CI_KEEP_LANGSMITH_PROJECTS") != "1": - langsmith_client.delete_project(project_name=project_name) + try: + langsmith_client.delete_project(project_name=project_name) + except Exception: + passBased on learnings: NAT_CI_KEEP_LANGSMITH_PROJECTS correctly follows the NAT_CI_ prefix convention for NAT CI infrastructure variables, distinguishing it from external SDK environment variables which should not be prefixed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nvidia_nat_test/src/nat/test/plugin.py` around lines 338 - 339, The teardown currently calls langsmith_client.delete_project(project_name=project_name) only when NAT_CI_KEEP_LANGSMITH_PROJECTS != "1" but lacks error handling; wrap the deletion in a try-except block (mirroring cleanup_prompts_fixture and cleanup_datasets_fixture) to catch any exception from langsmith_client.delete_project and log or warn the error without re-raising so cleanup failures don't fail the test run, keeping the NAT_CI_KEEP_LANGSMITH_PROJECTS check intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 75-96: The fixtures cleanup_prompts_fixture and
cleanup_datasets_fixture currently swallow exceptions silently in their teardown
loops; update the except blocks to log the full exception using
logger.exception(...) instead of an empty except so failures are recorded for
debugging while still continuing cleanup when calling
langsmith_client.delete_prompt(prompt_name) and
langsmith_client.delete_dataset(dataset_id=dataset_id) for the prompts and
dataset_ids lists respectively.
- Around line 208-321: The finally block in
test_optimizer_callback_links_trial_runs_and_feedback silently swallows
exceptions from langsmith_client.delete_project(trial_project); update the
finally block to catch Exception as e and log the failure (include context like
trial_project and exception) using the project's test logger (or a module-level
logger) so deletion failures are recorded; reference the trial_project variable
and langsmith_client.delete_project call to locate where to add the logging.
In `@packages/nvidia_nat_test/src/nat/test/plugin.py`:
- Around line 338-339: The teardown currently calls
langsmith_client.delete_project(project_name=project_name) only when
NAT_CI_KEEP_LANGSMITH_PROJECTS != "1" but lacks error handling; wrap the
deletion in a try-except block (mirroring cleanup_prompts_fixture and
cleanup_datasets_fixture) to catch any exception from
langsmith_client.delete_project and log or warn the error without re-raising so
cleanup failures don't fail the test run, keeping the
NAT_CI_KEEP_LANGSMITH_PROJECTS check intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 676ffba0-7904-48a8-8496-d8793f50b730
📒 Files selected for processing (5)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_evaluation_callback.pypackages/nvidia_nat_langchain/src/nat/plugins/langchain/langsmith/langsmith_optimization_callback.pypackages/nvidia_nat_langchain/tests/langsmith/test_langsmith_callback.pypackages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.pypackages/nvidia_nat_test/src/nat/test/plugin.py
Signed-off-by: David Gardner <dagardner@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py (1)
319-324: 💤 Low valueInconsistent exception handling in finally block.
The cleanup fixtures now use
logger.exception()for visibility when cleanup fails (lines 87, 99), but this finally block silently swallows exceptions. Consider applying the same pattern for consistency.Proposed fix
finally: if trial_project: try: langsmith_client.delete_project(project_name=trial_project) except Exception: - pass + logger.exception("Failed to delete trial project %s", trial_project)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py` around lines 319 - 324, The cleanup finally block swallows exceptions silently when deleting the trial project; change the except block to log the exception using the existing logger (e.g., call logger.exception) and include context about the failed cleanup and the project name, targeting the langsmith_client.delete_project(project_name=trial_project) call so failures are visible like other fixtures that use logger.exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Line 67: The _wait_for_feedback function is missing a type for the run_ids
parameter; update its signature (function _wait_for_feedback) to annotate
run_ids as a sequence of UUIDs (e.g., typing.Sequence[uuid.UUID] or
list[uuid.UUID]) and ensure uuid.UUID is available at runtime by either moving
the uuid import out of TYPE_CHECKING or using a forward/string annotation like
"typing.Sequence[uuid.UUID]"; keep langsmith_client: langsmith.client.Client and
timeout_s: float unchanged.
---
Nitpick comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Around line 319-324: The cleanup finally block swallows exceptions silently
when deleting the trial project; change the except block to log the exception
using the existing logger (e.g., call logger.exception) and include context
about the failed cleanup and the project name, targeting the
langsmith_client.delete_project(project_name=trial_project) call so failures are
visible like other fixtures that use logger.exception.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 94237ae4-4a31-49d8-997f-7c92d9be39ae
📒 Files selected for processing (2)
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.pypackages/nvidia_nat_test/src/nat/test/plugin.py
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/nvidia_nat_test/src/nat/test/plugin.py
…into david-18-fix-ls-integration Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
| """ | ||
| hex_str = format(span_id, "016x") | ||
| return f"00000000-0000-0000-{hex_str[:4]}-{hex_str[4:]}" | ||
| return str(uuid.UUID(int=span_id)) |
There was a problem hiding this comment.
Could we clarify the ID contract here?
The PR description mentions LangSmith IDs may appear as UUID objects or integer values, but this callback path only accepts integer root_span_id values. That may be the right NAT contract since EvalResultItem.root_span_id is currently typed/generated as an int.
If UUID objects can reach this callback, should the helper and the eager/retry/fallback guards accept int | uuid.UUID? If only integer IDs are valid here, could we make that explicit so future changes do not accidentally rely on UUID-object support?
There was a problem hiding this comment.
Currently we're using the int constructor argument for uuid.UUID which will raise a TypeError rather than cast another UUID instance, so if span_id was an instance of UUID it would raise an exception.
We could of course add an isinstance check here and handle it, but my first thought was that since this method isn't a part of the public API it wasn't needed.
mnajafian-nv
left a comment
There was a problem hiding this comment.
Thank you. Approving to unblock, assuming the two inline comments are addressed before merge. The implementation direction looks good; my comments are mainly about making the contract and regression coverage tighter.
Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-18-fix-ls-integration Signed-off-by: David Gardner <dagardner@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py`:
- Line 170: The test currently evaluates the set comprehension
{example.inputs["nat_item_id"] for example in examples} == {"q1", "q2"} but
doesn't assert it; add an assertion so the comparison actually fails when values
differ. Modify the line with the set comparison (using the same expression) to
be asserted (e.g., prepend assert) so that the test enforces that examples
contains nat_item_id values "q1" and "q2".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f2c6c047-a622-4705-9a12-51ac866537d2
📒 Files selected for processing (1)
packages/nvidia_nat_langchain/tests/langsmith/test_langsmith_integration.py
Signed-off-by: David Gardner <dagardner@nvidia.com>
Description
EvalCallbackManager.on_dataset_loadedexpects a list ofEvalInputItemnot dictsEvalCallbackManager.on_eval_completedoesn't create runs, instead associate theEvalResultItemwith a run by setting theroot_span_idBy Submitting this PR I confirm:
Summary by CodeRabbit
Bug Fixes
Tests
Chores