Skip to content

Conversation

@shagun-singh-inkeep
Copy link
Collaborator

No description provided.

@changeset-bot
Copy link

changeset-bot bot commented Oct 28, 2025

⚠️ No Changeset found

Latest commit: 940d9df

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link

vercel bot commented Oct 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
agents-manage-api Ready Ready Preview Comment Oct 30, 2025 5:28pm
agents-manage-ui Ready Ready Preview Comment Oct 30, 2025 5:28pm
agents-run-api Ready Ready Preview Comment Oct 30, 2025 5:28pm
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
agents-docs Skipped Skipped Oct 30, 2025 5:28pm

@claude
Copy link

claude bot commented Oct 28, 2025

Claude finished @shagun-singh-inkeep's task —— View job


TypeScript Staff Engineer Review

I've completed my comprehensive analysis of this schema evaluation PR. This introduces a significant evaluation framework with several concerning issues.

Todo List:

  • Read and analyze the schema changes in packages/agents-core/src/db/schema.ts
  • Research existing codebase patterns and related files for context
  • Evaluate technical implementation and code quality
  • Assess system architecture and design decisions
  • Review customer impact and API considerations
  • Provide structured feedback and recommendations

🚨 CRITICAL Issues

1. BREAKING CHANGE: Inconsistent Scoping Model

  • The PR fundamentally changes scoping patterns without justification
  • datasetItem and evalTestSuiteRun removed tenant scoping (tenantScoped → simple id primary key)
  • dataset, evaluator, evalTestSuiteConfig retain tenant scoping
  • This creates data isolation vulnerabilities where dataset items could theoretically be accessed across tenants

Reference: packages/agents-core/src/db/schema.ts:696-730, packages/agents-core/src/db/schema.ts:761-785

2. Broken Foreign Key Relationships

  • Relations updated to use simple field references instead of composite tenant+project keys
  • Example: evalResultRelations conversation lookup changed from [evalResult.tenantId, evalResult.projectId, evalResult.conversationId] to just [evalResult.conversationId]
  • Risk: Cross-tenant data access if conversation IDs collide
  • Impact: Security vulnerability in multi-tenant system

🔧 System Architecture Issues

3. Inconsistent Design Patterns

  • Existing codebase follows consistent tenant scoping pattern across all tables
  • This PR breaks the pattern without establishing clear architectural guidelines
  • Risk: Future developers will be confused about when to use which pattern
  • Maintenance burden: Two different scoping models to maintain

4. Missing Validation Schema Integration

  • New tables reference ModelSettings type but no corresponding validation schemas created
  • No integration with existing validation patterns in /packages/agents-core/src/validation/schemas.ts
  • Risk: Runtime type errors, inconsistent API validation

5. No Migration Strategy

  • No evidence of database migration files or strategy for existing data
  • Breaking change requires careful data migration planning

🔍 Technical Implementation Concerns

6. Relationship Complexity

// Overly complex evaluation configuration structure
export const conversationEvaluationConfigEvaluator = sqliteTable(
  'conversation_evaluation_config_evaluator', // Long table name
  {
    id: text('id').notNull(),
    conversationEvaluationConfigId: text('conversation_evaluation_config_id').notNull(),
    evaluatorId: text('evaluator_id').notNull(),
    // ...
  }
  • Junction table naming could be simplified
  • Consider conv_eval_config_evaluators for brevity while maintaining clarity

7. Inconsistent ID Strategy

  • Most tables use composite primary keys following tenant scoping
  • New tables mix simple id primary keys with foreign key references
  • Recommendation: Stick to one pattern consistently

8. JSON Schema Storage Without Validation

schema: blob('schema', { mode: 'json' }).$type<Record<string, unknown>>().notNull(),
  • Storing evaluation schemas as untyped JSON
  • No runtime validation ensures schema integrity
  • Risk: Invalid schemas could break evaluation pipelines

📊 Customer Impact Analysis

9. API Consistency Breaking

  • If this affects customer-facing APIs, the scoping changes will break existing integrations
  • Dataset and evaluation APIs will have different authentication/authorization patterns than other resources
  • Critical: Need explicit strategy for backward compatibility

10. Missing Feature Documentation

  • No clear understanding of the evaluation workflow from schema alone
  • Complex many-to-many relationships without usage examples
  • Risk: Difficult for API consumers to understand proper usage patterns

Recommendations

Immediate Actions Required:

  1. Revert scoping changes - Use consistent tenantScoped/projectScoped patterns for ALL new tables
  2. Fix foreign key relationships - Maintain composite key relationships for proper tenant isolation
  3. Add validation schemas - Create Zod schemas for all new types in /packages/agents-core/src/validation/schemas.ts
  4. Create migration strategy - Document how existing data will be handled

Architectural Improvements:

  1. Simplify table names - Consider shorter, clearer names (e.g., eval_config_evaluators)
  2. Add proper typing - Replace Record<string, unknown> with proper typed interfaces
  3. Document evaluation workflow - Add comprehensive documentation for the evaluation system

Security & Best Practices:

  1. Tenant isolation audit - Ensure all new tables properly implement tenant boundaries
  2. Add database constraints - Consider additional constraints on status fields and relationships
  3. Performance indexing - Plan indexes for common query patterns

This evaluation framework addition is substantial but needs significant architectural consistency improvements before it can safely merge. The tenant scoping changes represent a critical security concern that must be addressed.

Verdict: ❌ Major revisions required - Security and consistency issues must be resolved.


@robert-inkeep
Copy link
Collaborator

@claude can you review this again?

want to make sure its clear the schemas and the relationships are clear

ignore the composite keys and the tenant scoping, ignore any indexing / perf concerns

@claude
Copy link

claude bot commented Oct 28, 2025

Claude finished @robert-inkeep's task —— View job


Schema Review: Focus on Clarity and Relationships

I've reviewed the evaluation framework schema additions, focusing specifically on clarity and relationships as requested (ignoring composite keys, tenant scoping, and performance).

Todo List:

  • Read the schema changes in packages/agents-core/src/db/schema.ts
  • Map out the evaluation framework structure and relationships
  • Assess clarity of table names and field definitions
  • Review relationship definitions for clarity
  • Provide feedback focused on schema and relationship clarity

Clear Aspects

Well-Structured Core Concepts

The evaluation framework introduces a clear hierarchy:

  1. DatasetDatasetItem (1:N) - Clear parent-child for test data
  2. Evaluator - Self-contained evaluation definition with prompt and schema
  3. EvalTestSuiteRun - Execution record linking dataset, agent, and config
  4. EvalResult - Individual evaluation outcomes

Good Field Definitions

  • Clear status enums: 'pending'|'done'|'failed' (lines 769, 822)
  • Well-typed configuration objects with clear structure:
    • simulationConfig with detailed agent simulation settings (lines 706-718)
    • conversationFilter with multiple filtering options (lines 853-860)
  • Explicit model configuration fields using ModelSettings type

🔍 Areas Needing Clarity

1. Relationship Ambiguity: Two Evaluation Paths

The schema appears to support two different evaluation workflows, but their relationship is unclear:

Path A: Test Suite Evaluations

  • evalTestSuiteRunevalTestSuiteRunEvaluatorevaluator
  • For running predefined test suites against agents

Path B: Conversation Evaluations

  • conversationEvaluationConfigconversationEvaluationConfigEvaluatorevaluator
  • For evaluating live conversations

Issue: evalResult serves both paths (has both suiteRunId and conversationId), but this dual-purpose isn't clear from the schema structure.

2. Optional vs Required Relationships

In evalResult (lines 815-846):

  • suiteRunId is optional (text('suite_run_id'))
  • datasetItemId is optional
  • conversationId is required

This suggests evalResult can be:

  • A test suite result (with suiteRunId + datasetItemId)
  • A conversation evaluation result (with just conversationId)
  • Both? Neither? The schema doesn't make the valid combinations clear.

Recommendation: Consider adding a discriminator field like evaluationType: 'suite' | 'conversation' or splitting into two tables.

3. Unclear Table Naming

Long Junction Table Name:

  • conversationEvaluationConfigEvaluator (line 872) is verbose
  • Consider: conv_eval_config_evaluator or conversation_eval_evaluators

Inconsistent Pluralization:

  • eval_test_suite_run_evaluators (plural) vs other junction tables
  • Consider consistent naming convention

4. Missing Relationship Context

evalTestSuiteRun.agent relationship (lines 767, 780-783):

  • References agents.id directly without tenant/project context
  • While you asked to ignore scoping, this makes the relationship intent unclear
  • Does this mean a test suite can run against agents from different projects?

5. Schema Storage Without Clear Purpose

evaluator.schema (line 737):

  • Stores as Record<string, unknown>
  • Purpose unclear: Is this for validating evaluation results? Defining expected output format?
  • Consider adding a comment or more specific type

datasetItem.expectedOutput (line 705):

  • Array of messages but relationship to evaluator.schema unclear
  • How do these work together during evaluation?

📋 Recommendations for Clarity

1. Add Discriminator for Dual-Purpose Tables

evalResult: {
  evaluationType: text('evaluation_type').$type<'suite' | 'conversation'>().notNull(),
  // This clarifies which fields should be populated
}

2. Document Complex Relationships

Add comments explaining the evaluation flow:

// Test suite evaluation flow:
// evalTestSuiteRun -> evalTestSuiteRunEvaluator -> evaluator -> evalResult
// Conversation evaluation flow:  
// conversationEvaluationConfig -> conversationEvaluationConfigEvaluator -> evaluator -> evalResult

3. Clarify Optional Field Combinations

Either:

  • Make fields required based on use case, or
  • Add check constraints to ensure valid combinations, or
  • Document valid field combinations in comments

4. Simplify Table Names

  • conversationEvaluationConfigEvaluatorconv_eval_evaluators
  • Keep names under 30 characters for better readability

5. Type the Schema Fields

Replace Record<string, unknown> with a more specific type or interface that clarifies the expected structure of evaluation schemas.


Summary

The evaluation framework's core structure is sound and the relationships are technically correct. However, the dual-purpose nature of evalResult and the two parallel evaluation paths could benefit from clearer distinction. The schema would be more maintainable with:

  1. Clearer separation or documentation of the two evaluation workflows
  2. More explicit handling of optional field combinations
  3. Simplified naming for better readability
  4. Better typing for JSON schema fields

The relationships themselves are well-defined through proper foreign keys and the Drizzle relations are correctly configured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants