Skip to content

Conversation

@JasonWarrenUK
Copy link
Contributor

Domain Whitelist Configuration for Research

Overview

This PR implements a comprehensive domain whitelist system allowing users to control which websites Claude can access during curriculum generation. The feature works across both Metis (standalone modules) and Themis (course builder) with hierarchical configuration support.

Users can now:

  • Select from predefined domain lists (e.g., "AI Engineering" with 29 trusted sources)
  • Add custom domains with wildcard and subpath support
  • Choose "No Restrictions" for unrestricted research
  • Configure research hierarchically in Themis (course → arc → module levels)

Tip

No action required after pulling - all changes are backward compatible. Existing modules will continue using the default "AI Engineering" domain list.

Changes

Type System & Configuration Foundation

Restructured domain configuration from flat arrays to organized, extensible structures.

  • Created src/lib/types/config.ts with type definitions for domain lists, configs, and hierarchical research
  • Restructured src/lib/config/researchDomains.ts into categorized format (7 categories, 29 domains)
  • Maintained backward compatibility with AI_RESEARCH_DOMAINS_FLAT export
  • Added helper functions: extractDomainUrls(), getDomainListById(), getAllDomainLists()
Validation & Resolution Utilities

Created utilities for domain validation and configuration resolution.

  • Domain Validator (src/lib/utils/validation/domainValidator.ts):

    • Validates domains with wildcard support (*.github.com)
    • Supports subpaths (example.com/blog)
    • Batch validation with error tracking
  • Domain Resolver (src/lib/utils/research/domainResolver.ts):

    • Converts DomainConfig to domain arrays
    • Handles predefined lists, custom domains, and "no restrictions"
    • Provides human-readable descriptions for UI
  • Config Resolver (src/lib/utils/research/configResolver.ts):

    • Resolves Themis hierarchical config (Module → Arc → Course → Default)
    • Priority-based cascade with override support
Metis UI Components

Added domain configuration to standalone module workflow.

  • Created src/lib/components/metis/DomainSelector.svelte:

    • Dropdown for "AI Engineering" or "No Restrictions"
    • Custom domain input with real-time validation
    • Add/remove custom domains with visual feedback
  • Updated src/lib/components/metis/StructuredInputForm.svelte:

    • Integrated DomainSelector in "Advanced Research Options"
    • Conditional visibility when research enabled
    • Event handling for config changes
  • Updated src/lib/stores/metisStores.ts with default domain config

Metis API Integration

Wired domain configuration through Metis generation API.

  • Updated src/routes/api/metis/update/+server.ts:

    • Extract domainConfig from request body
    • Resolve domains using getDomains() utility
    • Pass resolved domains to withWebSearch()
    • Applied to both streaming and non-streaming paths
  • Updated src/lib/factories/agents/agentClientFactory.ts:

    • Handle empty array as "no restrictions"
    • Updated JSDoc with usage examples
Themis Hierarchical Configuration

Implemented three-level hierarchical research config in course builder.

  • Created src/lib/components/themis/ResearchConfigSelector.svelte:

    • Reusable component for course/arc/module levels
    • Adaptive labels based on context
    • Radio selection: 'all' | 'selective' | 'none'
    • Integrates DomainSelector when level = 'all'
  • Updated Themis components:

    • CourseConfigForm: Course-level configuration
    • ArcStructurePlanner: Per-arc config (conditional on course setting)
    • ModuleWithinArcPlanner: Per-module toggles (conditional on arc setting)
  • Updated src/lib/types/themis.ts with researchConfig fields

  • Updated Themis APIs:

    • Structure generation: src/routes/api/themis/generate/+server.ts
    • Module generation: src/routes/api/themis/module/+server.ts
Schema Updates

Extended Zod validation schemas for domain configuration.

  • Added DomainConfigSchema with custom validation refinement
  • Added ResearchConfigSchema for Themis hierarchy
  • Updated StructuredInputSchema, GenerateRequestSchema, ModuleGenerationRequestSchema
  • Added schemas for all Themis request/response types
Documentation

Comprehensive user and developer documentation.

  • Updated README.md:

    • New "Domain Configuration" section with examples
    • Explains predefined lists, custom domains, hierarchical config
    • Domain format examples (wildcards, subpaths)
  • Created docs/dev/architecture/domain-whitelist-adr.md:

    • Architecture Decision Record documenting all design choices
    • Rationale for key decisions (whole-list selection, empty array = unrestricted, etc.)
    • Implementation details and consequences
    • Future enhancement roadmap
  • Created docs/wip/domain-whitelist-implementation.md:

    • Complete implementation plan with 8 phases
    • Session-by-session progress tracking
    • Testing checklist and design decisions

Summary

Imagine you're a bouncer at an exclusive nightclub called "Research Plaza." Previously, you had a rigid velvet rope with exactly 40 people's names permanently stitched onto it - no flexibility, no changes, just those 40 folks forever. Now you've been upgraded to a fancy iPad with multiple guest lists! You can pick the "VIP Tech Crowd" list (29 carefully curated guests), add your own friends to the list on the fly ("Oh hey, *.github.com is cool, let them in"), or go full Burning Man and just remove the velvet rope entirely ("No Restrictions - everyone's invited!"). Plus, if you're managing multiple rooms (that's Themis), you can set different policies for each: the main ballroom follows house rules, but the jazz lounge can configure per-band, and the karaoke room lets each singer decide if they want backing vocals from the internet. It's like going from a dictator's guest list to a choose-your-own-adventure bouncer simulator.

- Add type definitions for domain lists (config.ts)
- Restructure AI_RESEARCH_DOMAINS to organized categories
- Maintain backward compatibility with AI_RESEARCH_DOMAINS_FLAT
- Add helper functions for domain list management
- Create implementation planning document

Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements)
Phase 1/8: Foundation for UI and API domain configuration
- Create domain validation utility with wildcard/subpath support
- Add domain resolution utility for config to array conversion
- Implement Themis hierarchical config resolver (Module→Arc→Course)
- Extend Zod schemas with DomainConfig and ResearchConfig
- Add domain validation to all Metis and Themis request schemas

Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements)
Phase 2/8: Validation layer for domain configuration
- Create DomainSelector component with dropdown and custom domain input
- Integrate into StructuredInputForm with conditional visibility
- Add domain validation utility (wildcards, subpaths, ports)
- Update stores with default domain config
- Add DomainConfigSchema to API validator

Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements)
Phase 3/8: First visible UI changes for domain configuratio
- Create domain resolver utility for config to array conversion
- Create Themis config resolver for hierarchical resolution
- Update API to extract and use domain configuration
- Handle unrestricted research (empty array = all domains)
- Add ResearchConfig schemas for Themis integration
- Apply domain filtering in both streaming and non-streaming paths

Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements)
Phase 4/8: Fully functional domain configuration in Metis
- Create ResearchConfigSelector component for all three levels
- Add research config UI to course, arc, and module forms
- Conditional display based on parent level selection
- Update Themis types with researchConfig fields
- Integrate domain resolution in structure generation API
- Add hierarchical config resolver utilities

Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements)
Phase 5/8: Hierarchical domain configuration in Themis (95% complete)

Note: Manual edit needed in src/routes/api/themis/module/+server.ts
See docs/wip/domain-whitelist-implementation.md for details
@JasonWarrenUK JasonWarrenUK merged commit 8565614 into main Oct 29, 2025
1 check passed
@JasonWarrenUK JasonWarrenUK deleted the metis/whitelist-definitions branch October 29, 2025 22:51
@claude
Copy link

claude bot commented Oct 29, 2025

Summary

This PR implements a comprehensive domain whitelist system that provides users with fine-grained control over web research during curriculum generation. The implementation is well-structured with excellent separation of concerns, thorough documentation, and a thoughtful hierarchical configuration system for Themis. The code demonstrates strong type safety with Zod validation and maintains backward compatibility. However, there are a few areas that would benefit from additional test coverage, error handling improvements, and minor refinements to the validation logic.

🎯 Code Quality & Best Practices

Excellent Architectural Decisions

Type Safety & Validation
The use of Zod schemas throughout (src/lib/schemas/apiValidator.ts) provides excellent runtime validation with compile-time type inference. The custom refinement for domain validation (lines 21-32) is particularly well done:

.refine((data) => {
  if (data.customDomains.length > 0) {
    return data.customDomains.every(domain => validateDomain(domain).valid);
  }
  return true;
}, { message: 'One or more custom domains are invalid' })

Separation of Concerns
The factory pattern for domain resolution is clean and maintainable:

  • domainValidator.ts - Pure validation logic
  • domainResolver.ts - Configuration resolution
  • configResolver.ts - Hierarchical cascade logic

Documentation
The ADR (docs/dev/architecture/domain-whitelist-adr.md) is exceptionally thorough, documenting design decisions, trade-offs, and future enhancements.

Areas for Improvement

1. Domain Validation Edge Cases (src/lib/utils/validation/domainValidator.ts:20-57)

The domain validation regex may not catch all edge cases:

  • International domain names (IDN) with Unicode characters
  • Ports are stripped but never validated (e.g., example.com:99999 would pass)
  • Path validation is minimal - doesn't check for invalid characters

Suggestion:

// Line 34: Add port validation
const [domainWithoutPort, port] = actualDomain.split(':');
if (port && (isNaN(+port) || +port < 1 || +port > 65535)) {
  return { valid: false, error: 'Invalid port number' };
}

// Consider adding URL path validation
const pathPart = trimmed.split('/').slice(1).join('/');
if (pathPart && !/^[a-zA-Z0-9-._~:/?#\[\]@!$&'()*+,;=%]*$/.test(pathPart)) {
  return { valid: false, error: 'Invalid path characters' };
}

2. Inconsistent Error Handling (src/lib/utils/research/domainResolver.ts:30-46)

When the default domain list is not found (line 43), the function returns hasRestrictions: false but includes an error. This is inconsistent - if there's an error, should research proceed unrestricted?

Suggestion:

if (!defaultList) {
  // Either throw an error or log a warning, but be explicit about the fallback behavior
  console.warn('Default domain list not found, proceeding with unrestricted research');
  return {
    domains: [],
    errors: [],
    hasRestrictions: false
  };
}

3. Component Accessibility (src/lib/components/metis/DomainSelector.svelte:138-142)

The remove button implementation is good with aria-label, but could be enhanced:

<button
  type="button"
  class="remove-domain"
  on:click={() => removeCustomDomain(domain)}
  aria-label="Remove {domain}"
  title="Remove {domain}"  <!-- Add tooltip -->
>&times;</button>

4. Magic Numbers (src/routes/api/metis/update/+server.ts:63)

The withWebSearch(model, 5, domains) hardcodes the max uses to 5. Consider extracting to a configuration constant:

// In a config file
export const WEB_SEARCH_MAX_USES = 5;

// In the API
model = withWebSearch(model, WEB_SEARCH_MAX_USES, domains);

5. Duplicate Domain Check Case Sensitivity (src/lib/components/metis/DomainSelector.svelte:44)

The duplicate check uses includes() on the normalized domain, but domains are already normalized to lowercase. The check should explicitly handle case-insensitive comparison:

if (domainConfig.customDomains.some(d => d.toLowerCase() === normalized.toLowerCase())) {
  validationError = "Domain already added";
  return;
}
🐛 Potential Issues

Critical

1. Race Condition in Domain Resolution (src/lib/utils/research/domainResolver.ts:96-103)

The deduplication logic creates a Set from existing domains then checks each custom domain. If two identical custom domains are added rapidly (unlikely but possible in UI), they could both pass the duplicate check:

const existingSet = new Set(domains.map(d => d.toLowerCase()));
normalizedCustom.forEach(domain => {
  if (!existingSet.has(domain.toLowerCase())) {
    domains.push(domain);
    existingSet.add(domain.toLowerCase()); // Add this line
  }
});

Moderate

2. Missing Null Check (src/lib/utils/research/domainResolver.ts:124-127)

The isUnrestricted() function assumes config?.customDomains exists, but the type allows it to be undefined:

export function isUnrestricted(config?: DomainConfig): boolean {
  if (!config || config.useList === null) {
    return (!config?.customDomains || config.customDomains.length === 0);
  }
  return false;
}

This works due to optional chaining, but would be clearer as:

return !config?.customDomains?.length;

3. Wildcard Domain Security (src/lib/utils/validation/domainValidator.ts:32-33)

The validator accepts wildcards like *.com which would match ALL .com domains. Consider adding validation:

// After line 33
if (hasWildcard && parts.length < 3) {
  return { valid: false, error: 'Wildcard must be more specific (e.g., *.example.com, not *.com)' };
}

4. Config Cascade Bug (src/lib/utils/research/configResolver.ts:44-57)

The cascade logic checks for level !== 'selective' but doesn't validate that the config has the necessary domainConfig when level === 'all'. This could cause issues downstream:

if (moduleConfig && moduleConfig.level !== 'selective') {
  if (moduleConfig.level === 'all' && !moduleConfig.domainConfig) {
    console.warn('Module config level is "all" but domainConfig is missing');
  }
  effectiveConfig = moduleConfig;
  resolvedFrom = 'module';
}

Minor

5. Empty String in Custom Domains Array

While the UI prevents adding empty strings, the API doesn't explicitly validate against them. The Zod schema should include:

customDomains: z.array(z.string().min(1, 'Domain cannot be empty')).default([])
⚡ Performance Considerations

Positive

1. Efficient Deduplication
Using Set for duplicate checking in domainResolver.ts:97 is the right approach for O(1) lookups.

2. Early Returns
Validation functions use early returns effectively, avoiding unnecessary computation.

Potential Improvements

1. Domain Extraction Efficiency (src/lib/config/researchDomains.ts)

The extractDomainUrls() function (not shown in the diff but referenced) likely iterates through all categories and sources. If called frequently, consider memoization:

import { memoize } from 'lodash'; // or implement simple memoization

export const extractDomainUrls = memoize((domainList: DomainList): Set<string> => {
  // existing implementation
});

2. Component Re-renders (src/lib/components/metis/DomainSelector.svelte:20)

The availableLists is computed on every component instantiation. Consider moving to module scope since it's static:

<script lang="ts" context="module">
  const AVAILABLE_LISTS = getAllDomainLists();
</script>

<script lang="ts">
  // Use AVAILABLE_LISTS instead of availableLists
</script>

3. Regex Compilation

The domain validation regex (domainValidator.ts:36) is compiled on every call. For frequently called validation, consider compiling once at module level:

const DOMAIN_REGEX = /^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?)*$/;

export function validateDomain(domain: string): DomainValidationResult {
  // Use DOMAIN_REGEX
}

4. Array Spreads in Event Handlers (DomainSelector.svelte:50)

Using spread operators in event handlers creates new arrays:

domainConfig.customDomains = [...domainConfig.customDomains, normalized];

For Svelte's reactivity, this is necessary, but document that this is intentional for reactivity triggers.

🔒 Security Concerns

High Priority

1. SSRF Risk with Custom Domains

Users can add custom domains that could potentially be used for Server-Side Request Forgery (SSRF) attacks if the web search tool makes requests to internal networks. Consider:

  • Validation against private IP ranges:
function isPrivateIP(domain: string): boolean {
  const privateRanges = [
    /^localhost$/i,
    /^127\./, // 127.0.0.0/8
    /^10\./, // 10.0.0.0/8
    /^172\.(1[6-9]|2[0-9]|3[0-1])\./, // 172.16.0.0/12
    /^192\.168\./, // 192.168.0.0/16
    /^169\.254\./ // 169.254.0.0/16
  ];
  return privateRanges.some(regex => regex.test(domain));
}

// In validateDomain():
if (isPrivateIP(domainWithoutPort)) {
  return { valid: false, error: 'Private IP addresses are not allowed' };
}
  • Block metadata endpoints: Consider blocking AWS metadata (169.254.169.254), GCP metadata (metadata.google.internal), etc.

2. Wildcard Domain Exploitation

As mentioned, *.com or *.org would match thousands of domains. Add validation:

if (hasWildcard) {
  const remainingParts = domainWithoutPort.split('.');
  if (remainingParts.length < 2) {
    return { valid: false, error: 'Wildcard must include at least a domain and TLD (e.g., *.example.com)' };
  }
}

Medium Priority

3. XSS via Domain Display

Domain strings are displayed in the UI (DomainSelector.svelte:137). While Svelte escapes by default, ensure custom domains can't contain HTML/JavaScript:

// Add to validation
if (/<|>|&|"/.test(trimmed)) {
  return { valid: false, error: 'Domain cannot contain HTML characters' };
}

4. Rate Limiting Custom Domain Addition

No rate limiting on custom domain additions. A malicious user could add thousands of custom domains. Consider:

  • Max custom domains limit (e.g., 50)
  • Server-side validation in API endpoints
// In apiValidator.ts
customDomains: z.array(z.string()).max(50, 'Maximum 50 custom domains allowed').default([])

Low Priority

5. Domain List Injection

The useList field accepts any string. While getDomainListById() returns undefined for invalid IDs, explicitly validate against known list IDs:

const VALID_LIST_IDS = ['ai-engineering'] as const;

export const DomainConfigSchema = z.object({
  useList: z.enum([...VALID_LIST_IDS, null as any]).nullable(),
  customDomains: z.array(z.string()).default([])
});
✅ Test Coverage

Missing Test Coverage

This PR adds substantial functionality but does not include any tests. The following areas critically need test coverage:

1. Domain Validation (domainValidator.ts)

Priority: HIGH

describe('validateDomain', () => {
  it('should accept valid domains', () => {
    expect(validateDomain('example.com').valid).toBe(true);
    expect(validateDomain('sub.example.com').valid).toBe(true);
  });

  it('should accept wildcards', () => {
    expect(validateDomain('*.github.com').valid).toBe(true);
  });

  it('should accept subpaths', () => {
    expect(validateDomain('example.com/blog').valid).toBe(true);
  });

  it('should reject invalid formats', () => {
    expect(validateDomain('example').valid).toBe(false);
    expect(validateDomain('.com').valid).toBe(false);
    expect(validateDomain('example..com').valid).toBe(false);
    expect(validateDomain('*.com').valid).toBe(false); // Too broad
  });

  it('should reject domains with spaces', () => {
    expect(validateDomain('example .com').valid).toBe(false);
  });

  it('should normalize domains', () => {
    expect(validateDomain('EXAMPLE.COM').normalized).toBe('example.com');
  });
});

2. Domain Resolution (domainResolver.ts)

Priority: HIGH

describe('resolveDomainList', () => {
  it('should return AI Engineering list by default', () => {
    const result = resolveDomainList();
    expect(result.domains.length).toBeGreaterThan(0);
    expect(result.hasRestrictions).toBe(true);
  });

  it('should return empty array for no restrictions', () => {
    const result = resolveDomainList({ useList: null, customDomains: [] });
    expect(result.domains).toEqual([]);
    expect(result.hasRestrictions).toBe(false);
  });

  it('should merge predefined list with custom domains', () => {
    const result = resolveDomainList({
      useList: 'ai-engineering',
      customDomains: ['example.com']
    });
    expect(result.domains).toContain('example.com');
    expect(result.domains.length).toBeGreaterThan(1);
  });

  it('should deduplicate domains', () => {
    const result = resolveDomainList({
      useList: 'ai-engineering',
      customDomains: ['anthropic.com'] // Already in AI list
    });
    const uniqueDomains = new Set(result.domains);
    expect(uniqueDomains.size).toBe(result.domains.length);
  });

  it('should handle invalid custom domains', () => {
    const result = resolveDomainList({
      useList: 'ai-engineering',
      customDomains: ['invalid', 'example.com']
    });
    expect(result.errors.length).toBeGreaterThan(0);
    expect(result.domains).toContain('example.com');
  });
});

3. Hierarchical Config Resolution (configResolver.ts)

Priority: HIGH

describe('resolveModuleResearchConfig', () => {
  it('should prioritize module config over arc and course', () => {
    const result = resolveModuleResearchConfig(
      { level: 'none' },
      { level: 'all', domainConfig: { useList: 'ai-engineering', customDomains: [] } },
      { level: 'all', domainConfig: { useList: 'ai-engineering', customDomains: [] } }
    );
    expect(result.enabled).toBe(false);
    expect(result.resolvedFrom).toBe('module');
  });

  it('should cascade to arc config when module is selective', () => {
    const result = resolveModuleResearchConfig(
      { level: 'selective' },
      { level: 'all', domainConfig: { useList: 'ai-engineering', customDomains: [] } },
      { level: 'none' }
    );
    expect(result.enabled).toBe(true);
    expect(result.resolvedFrom).toBe('arc');
  });

  it('should use default when all configs are selective', () => {
    const result = resolveModuleResearchConfig(
      { level: 'selective' },
      { level: 'selective' },
      { level: 'selective' }
    );
    expect(result.enabled).toBe(true);
    expect(result.resolvedFrom).toBe('default');
    expect(result.domainConfig?.useList).toBe('ai-engineering');
  });
});

4. Component Testing (DomainSelector.svelte)

Priority: MEDIUM

Use @testing-library/svelte to test:

  • Domain list selection updates config
  • Custom domain addition with validation
  • Custom domain removal
  • Duplicate domain rejection
  • Validation error display

5. API Integration Tests

Priority: MEDIUM

describe('POST /api/metis/update', () => {
  it('should accept valid domain config', async () => {
    const response = await fetch('/api/metis/update', {
      method: 'POST',
      body: JSON.stringify({
        projectsData: mockProjectsXML,
        skillsData: mockSkillsXML,
        researchData: mockResearchXML,
        enableResearch: true,
        domainConfig: {
          useList: 'ai-engineering',
          customDomains: ['example.com']
        }
      })
    });
    expect(response.ok).toBe(true);
  });

  it('should reject invalid domain config', async () => {
    const response = await fetch('/api/metis/update', {
      method: 'POST',
      body: JSON.stringify({
        projectsData: mockProjectsXML,
        skillsData: mockSkillsXML,
        researchData: mockResearchXML,
        enableResearch: true,
        domainConfig: {
          useList: 'ai-engineering',
          customDomains: ['invalid'] // No TLD
        }
      })
    });
    expect(response.status).toBe(400);
  });
});

Recommended Testing Framework Setup

npm install -D vitest @testing-library/svelte @testing-library/jest-dom

Create vitest.config.ts:

import { defineConfig } from 'vitest/config';
import { svelte } from '@sveltejs/vite-plugin-svelte';

export default defineConfig({
  plugins: [svelte({ hot: !process.env.VITEST })],
  test: {
    globals: true,
    environment: 'jsdom',
    setupFiles: ['./vitest.setup.ts']
  }
});

Recommendations

Before Merging (High Priority)

  1. Add security validations for private IPs and overly broad wildcards
  2. Add rate limiting for custom domains (max 50 suggested)
  3. Fix the duplicate domain deduplication race condition
  4. Extract magic numbers to configuration constants

After Merging (Medium Priority)

  1. Comprehensive test suite covering validation, resolution, and hierarchical config
  2. Performance optimizations (regex compilation, memoization)
  3. Enhanced error handling with clearer fallback behaviors
  4. Accessibility improvements (tooltips, focus management)

Future Enhancements (Low Priority)

  1. Domain validation for international domain names (IDN)
  2. Analytics on which domains are actually used during research
  3. User-created domain lists (beyond just custom domains)

Overall Assessment

This is a well-designed and thoughtfully implemented feature that significantly enhances Rhea's flexibility. The hierarchical configuration system is particularly elegant, and the documentation is exemplary. The main concerns are around security (SSRF risks) and the absence of tests for this substantial new functionality.

Recommended action: Address security concerns and add basic test coverage before merging. The other improvements can be tackled in follow-up PRs.

Great work! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants