Schema eval #837

shagun-singh-inkeep · 2025-10-28T18:19:38Z

No description provided.

changeset-bot · 2025-10-28T18:19:42Z

⚠️ No Changeset found

Latest commit: 940d9df

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2025-10-28T18:19:44Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
agents-manage-api	Ready	Preview	Comment	Oct 30, 2025 5:28pm
agents-manage-ui	Ready	Preview	Comment	Oct 30, 2025 5:28pm
agents-run-api	Ready	Preview	Comment	Oct 30, 2025 5:28pm

1 Skipped Deployment

Project	Deployment	Preview	Comments	Updated (UTC)
agents-docs	Skipped			Oct 30, 2025 5:28pm

claude · 2025-10-28T18:19:56Z

Claude finished @shagun-singh-inkeep's task —— View job

TypeScript Staff Engineer Review

I've completed my comprehensive analysis of this schema evaluation PR. This introduces a significant evaluation framework with several concerning issues.

Todo List:

Read and analyze the schema changes in packages/agents-core/src/db/schema.ts
Research existing codebase patterns and related files for context
Evaluate technical implementation and code quality
Assess system architecture and design decisions
Review customer impact and API considerations
Provide structured feedback and recommendations

🚨 CRITICAL Issues

1. BREAKING CHANGE: Inconsistent Scoping Model

The PR fundamentally changes scoping patterns without justification
datasetItem and evalTestSuiteRun removed tenant scoping (tenantScoped → simple id primary key)
dataset, evaluator, evalTestSuiteConfig retain tenant scoping
This creates data isolation vulnerabilities where dataset items could theoretically be accessed across tenants

Reference: packages/agents-core/src/db/schema.ts:696-730, packages/agents-core/src/db/schema.ts:761-785

2. Broken Foreign Key Relationships

Relations updated to use simple field references instead of composite tenant+project keys
Example: evalResultRelations conversation lookup changed from [evalResult.tenantId, evalResult.projectId, evalResult.conversationId] to just [evalResult.conversationId]
Risk: Cross-tenant data access if conversation IDs collide
Impact: Security vulnerability in multi-tenant system

🔧 System Architecture Issues

3. Inconsistent Design Patterns

Existing codebase follows consistent tenant scoping pattern across all tables
This PR breaks the pattern without establishing clear architectural guidelines
Risk: Future developers will be confused about when to use which pattern
Maintenance burden: Two different scoping models to maintain

4. Missing Validation Schema Integration

New tables reference ModelSettings type but no corresponding validation schemas created
No integration with existing validation patterns in /packages/agents-core/src/validation/schemas.ts
Risk: Runtime type errors, inconsistent API validation

5. No Migration Strategy

No evidence of database migration files or strategy for existing data
Breaking change requires careful data migration planning

🔍 Technical Implementation Concerns

6. Relationship Complexity

// Overly complex evaluation configuration structure
export const conversationEvaluationConfigEvaluator = sqliteTable(
  'conversation_evaluation_config_evaluator', // Long table name
  {
    id: text('id').notNull(),
    conversationEvaluationConfigId: text('conversation_evaluation_config_id').notNull(),
    evaluatorId: text('evaluator_id').notNull(),
    // ...
  }

Junction table naming could be simplified
Consider conv_eval_config_evaluators for brevity while maintaining clarity

7. Inconsistent ID Strategy

Most tables use composite primary keys following tenant scoping
New tables mix simple id primary keys with foreign key references
Recommendation: Stick to one pattern consistently

8. JSON Schema Storage Without Validation

schema: blob('schema', { mode: 'json' }).$type<Record<string, unknown>>().notNull(),

Storing evaluation schemas as untyped JSON
No runtime validation ensures schema integrity
Risk: Invalid schemas could break evaluation pipelines

📊 Customer Impact Analysis

9. API Consistency Breaking

If this affects customer-facing APIs, the scoping changes will break existing integrations
Dataset and evaluation APIs will have different authentication/authorization patterns than other resources
Critical: Need explicit strategy for backward compatibility

10. Missing Feature Documentation

No clear understanding of the evaluation workflow from schema alone
Complex many-to-many relationships without usage examples
Risk: Difficult for API consumers to understand proper usage patterns

✅ Recommendations

Immediate Actions Required:

Revert scoping changes - Use consistent tenantScoped/projectScoped patterns for ALL new tables
Fix foreign key relationships - Maintain composite key relationships for proper tenant isolation
Add validation schemas - Create Zod schemas for all new types in /packages/agents-core/src/validation/schemas.ts
Create migration strategy - Document how existing data will be handled

Architectural Improvements:

Simplify table names - Consider shorter, clearer names (e.g., eval_config_evaluators)
Add proper typing - Replace Record<string, unknown> with proper typed interfaces
Document evaluation workflow - Add comprehensive documentation for the evaluation system

Security & Best Practices:

Tenant isolation audit - Ensure all new tables properly implement tenant boundaries
Add database constraints - Consider additional constraints on status fields and relationships
Performance indexing - Plan indexes for common query patterns

This evaluation framework addition is substantial but needs significant architectural consistency improvements before it can safely merge. The tenant scoping changes represent a critical security concern that must be addressed.

Verdict: ❌ Major revisions required - Security and consistency issues must be resolved.

robert-inkeep · 2025-10-28T23:10:07Z

@claude can you review this again?

want to make sure its clear the schemas and the relationships are clear

ignore the composite keys and the tenant scoping, ignore any indexing / perf concerns

claude · 2025-10-28T23:10:44Z

Claude finished @robert-inkeep's task —— View job

Schema Review: Focus on Clarity and Relationships

I've reviewed the evaluation framework schema additions, focusing specifically on clarity and relationships as requested (ignoring composite keys, tenant scoping, and performance).

Todo List:

Read the schema changes in packages/agents-core/src/db/schema.ts
Map out the evaluation framework structure and relationships
Assess clarity of table names and field definitions
Review relationship definitions for clarity
Provide feedback focused on schema and relationship clarity

✅ Clear Aspects

Well-Structured Core Concepts

The evaluation framework introduces a clear hierarchy:

Dataset → DatasetItem (1:N) - Clear parent-child for test data
Evaluator - Self-contained evaluation definition with prompt and schema
EvalTestSuiteRun - Execution record linking dataset, agent, and config
EvalResult - Individual evaluation outcomes

Good Field Definitions

Clear status enums: 'pending'|'done'|'failed' (lines 769, 822)
Well-typed configuration objects with clear structure:
- simulationConfig with detailed agent simulation settings (lines 706-718)
- conversationFilter with multiple filtering options (lines 853-860)
Explicit model configuration fields using ModelSettings type

🔍 Areas Needing Clarity

1. Relationship Ambiguity: Two Evaluation Paths

The schema appears to support two different evaluation workflows, but their relationship is unclear:

Path A: Test Suite Evaluations

evalTestSuiteRun → evalTestSuiteRunEvaluator → evaluator
For running predefined test suites against agents

Path B: Conversation Evaluations

conversationEvaluationConfig → conversationEvaluationConfigEvaluator → evaluator
For evaluating live conversations

Issue: evalResult serves both paths (has both suiteRunId and conversationId), but this dual-purpose isn't clear from the schema structure.

2. Optional vs Required Relationships

In evalResult (lines 815-846):

suiteRunId is optional (text('suite_run_id'))
datasetItemId is optional
conversationId is required

This suggests evalResult can be:

A test suite result (with suiteRunId + datasetItemId)
A conversation evaluation result (with just conversationId)
Both? Neither? The schema doesn't make the valid combinations clear.

Recommendation: Consider adding a discriminator field like evaluationType: 'suite' | 'conversation' or splitting into two tables.

3. Unclear Table Naming

Long Junction Table Name:

conversationEvaluationConfigEvaluator (line 872) is verbose
Consider: conv_eval_config_evaluator or conversation_eval_evaluators

Inconsistent Pluralization:

eval_test_suite_run_evaluators (plural) vs other junction tables
Consider consistent naming convention

4. Missing Relationship Context

evalTestSuiteRun.agent relationship (lines 767, 780-783):

References agents.id directly without tenant/project context
While you asked to ignore scoping, this makes the relationship intent unclear
Does this mean a test suite can run against agents from different projects?

5. Schema Storage Without Clear Purpose

evaluator.schema (line 737):

Stores as Record<string, unknown>
Purpose unclear: Is this for validating evaluation results? Defining expected output format?
Consider adding a comment or more specific type

datasetItem.expectedOutput (line 705):

Array of messages but relationship to evaluator.schema unclear
How do these work together during evaluation?

📋 Recommendations for Clarity

1. Add Discriminator for Dual-Purpose Tables

evalResult: {
  evaluationType: text('evaluation_type').$type<'suite' | 'conversation'>().notNull(),
  // This clarifies which fields should be populated
}

2. Document Complex Relationships

Add comments explaining the evaluation flow:

// Test suite evaluation flow:
// evalTestSuiteRun -> evalTestSuiteRunEvaluator -> evaluator -> evalResult
// Conversation evaluation flow:  
// conversationEvaluationConfig -> conversationEvaluationConfigEvaluator -> evaluator -> evalResult

3. Clarify Optional Field Combinations

Either:

Make fields required based on use case, or
Add check constraints to ensure valid combinations, or
Document valid field combinations in comments

4. Simplify Table Names

conversationEvaluationConfigEvaluator → conv_eval_evaluators
Keep names under 30 characters for better readability

5. Type the Schema Fields

Replace Record<string, unknown> with a more specific type or interface that clarifies the expected structure of evaluation schemas.

Summary

The evaluation framework's core structure is sound and the relationships are technically correct. However, the dual-purpose nature of evalResult and the two parallel evaluation paths could benefit from clearer distinction. The schema would be more maintainable with:

Clearer separation or documentation of the two evaluation workflows
More explicit handling of optional field combinations
Simplified naming for better readability
Better typing for JSON schema fields

The relationships themselves are well-defined through proper foreign keys and the Drizzle relations are correctly configured.

shagun-singh-inkeep added 3 commits October 28, 2025 14:08

schema

8273985

schema

554fa35

schema

8d75894

vercel bot deployed to Preview – agents-manage-api October 28, 2025 18:19 View deployment

vercel bot deployed to Preview – agents-run-api October 28, 2025 18:19 View deployment

vercel bot deployed to Preview – agents-manage-ui October 28, 2025 18:21 View deployment

schema update

df3dc04

vercel bot temporarily deployed to Preview – agents-docs October 28, 2025 19:14 Inactive

vercel bot deployed to Preview – agents-manage-api October 28, 2025 19:16 View deployment

vercel bot deployed to Preview – agents-run-api October 28, 2025 19:16 View deployment

vercel bot deployed to Preview – agents-manage-ui October 28, 2025 19:18 View deployment

schema update - agent added to run

7c5bfa0

vercel bot temporarily deployed to Preview – agents-docs October 28, 2025 19:31 Inactive

vercel bot deployed to Preview – agents-manage-api October 28, 2025 19:33 View deployment

vercel bot deployed to Preview – agents-run-api October 28, 2025 19:33 View deployment

vercel bot deployed to Preview – agents-manage-ui October 28, 2025 19:34 View deployment

schema update - type for status

f9f8b18

vercel bot temporarily deployed to Preview – agents-docs October 28, 2025 19:36 Inactive

vercel bot deployed to Preview – agents-manage-api October 28, 2025 19:37 View deployment

vercel bot deployed to Preview – agents-run-api October 28, 2025 19:38 View deployment

vercel bot deployed to Preview – agents-manage-ui October 28, 2025 19:39 View deployment

schema update - adding own conversationsEvalRelations

53d4072

vercel bot temporarily deployed to Preview – agents-docs October 29, 2025 13:35 Inactive

vercel bot deployed to Preview – agents-manage-api October 29, 2025 13:36 View deployment

vercel bot deployed to Preview – agents-run-api October 29, 2025 13:36 View deployment

vercel bot deployed to Preview – agents-manage-ui October 29, 2025 13:38 View deployment

draft of extra routes and service

940d9df

vercel bot temporarily deployed to Preview – agents-docs October 30, 2025 17:02 Inactive

vercel bot deployed to Preview – agents-run-api October 30, 2025 17:27 View deployment

vercel bot deployed to Preview – agents-manage-api October 30, 2025 17:27 View deployment

vercel bot deployed to Preview – agents-manage-ui October 30, 2025 17:28 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Schema eval #837

Schema eval #837

Uh oh!

shagun-singh-inkeep commented Oct 28, 2025

Uh oh!

changeset-bot bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

vercel bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

claude bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

robert-inkeep commented Oct 28, 2025

Uh oh!

claude bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Schema eval #837

Are you sure you want to change the base?

Schema eval #837

Uh oh!

Conversation

shagun-singh-inkeep commented Oct 28, 2025

Uh oh!

changeset-bot bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

vercel bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TypeScript Staff Engineer Review

Todo List:

🚨 CRITICAL Issues

🔧 System Architecture Issues

🔍 Technical Implementation Concerns

📊 Customer Impact Analysis

✅ Recommendations

Uh oh!

robert-inkeep commented Oct 28, 2025

Uh oh!

claude bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Schema Review: Focus on Clarity and Relationships

Todo List:

✅ Clear Aspects

Well-Structured Core Concepts

Good Field Definitions

🔍 Areas Needing Clarity

1. Relationship Ambiguity: Two Evaluation Paths

2. Optional vs Required Relationships

3. Unclear Table Naming

4. Missing Relationship Context

5. Schema Storage Without Clear Purpose

📋 Recommendations for Clarity

1. Add Discriminator for Dual-Purpose Tables

2. Document Complex Relationships

3. Clarify Optional Field Combinations

4. Simplify Table Names

5. Type the Schema Fields

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

changeset-bot bot commented Oct 28, 2025 •

edited

Loading

vercel bot commented Oct 28, 2025 •

edited

Loading

claude bot commented Oct 28, 2025 •

edited

Loading

claude bot commented Oct 28, 2025 •

edited

Loading