-
Notifications
You must be signed in to change notification settings - Fork 0
Domain Whitelist Configuration for Research #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add type definitions for domain lists (config.ts) - Restructure AI_RESEARCH_DOMAINS to organized categories - Maintain backward compatibility with AI_RESEARCH_DOMAINS_FLAT - Add helper functions for domain list management - Create implementation planning document Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements) Phase 1/8: Foundation for UI and API domain configuration
- Create domain validation utility with wildcard/subpath support - Add domain resolution utility for config to array conversion - Implement Themis hierarchical config resolver (Module→Arc→Course) - Extend Zod schemas with DomainConfig and ResearchConfig - Add domain validation to all Metis and Themis request schemas Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements) Phase 2/8: Validation layer for domain configuration
- Create DomainSelector component with dropdown and custom domain input - Integrate into StructuredInputForm with conditional visibility - Add domain validation utility (wildcards, subpaths, ports) - Update stores with default domain config - Add DomainConfigSchema to API validator Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements) Phase 3/8: First visible UI changes for domain configuratio
- Create domain resolver utility for config to array conversion - Create Themis config resolver for hierarchical resolution - Update API to extract and use domain configuration - Handle unrestricted research (empty array = all domains) - Add ResearchConfig schemas for Themis integration - Apply domain filtering in both streaming and non-streaming paths Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements) Phase 4/8: Fully functional domain configuration in Metis
- Create ResearchConfigSelector component for all three levels - Add research config UI to course, arc, and module forms - Conditional display based on parent level selection - Update Themis types with researchConfig fields - Integrate domain resolution in structure generation API - Add hierarchical config resolver utilities Part of tasks 2.4, 2.5, 1.2.5 (domain whitelist enhancements) Phase 5/8: Hierarchical domain configuration in Themis (95% complete) Note: Manual edit needed in src/routes/api/themis/module/+server.ts See docs/wip/domain-whitelist-implementation.md for details
SummaryThis PR implements a comprehensive domain whitelist system that provides users with fine-grained control over web research during curriculum generation. The implementation is well-structured with excellent separation of concerns, thorough documentation, and a thoughtful hierarchical configuration system for Themis. The code demonstrates strong type safety with Zod validation and maintains backward compatibility. However, there are a few areas that would benefit from additional test coverage, error handling improvements, and minor refinements to the validation logic. 🎯 Code Quality & Best PracticesExcellent Architectural DecisionsType Safety & Validation ⭐ .refine((data) => {
if (data.customDomains.length > 0) {
return data.customDomains.every(domain => validateDomain(domain).valid);
}
return true;
}, { message: 'One or more custom domains are invalid' })Separation of Concerns ⭐
Documentation ⭐ Areas for Improvement1. Domain Validation Edge Cases ( The domain validation regex may not catch all edge cases:
Suggestion: // Line 34: Add port validation
const [domainWithoutPort, port] = actualDomain.split(':');
if (port && (isNaN(+port) || +port < 1 || +port > 65535)) {
return { valid: false, error: 'Invalid port number' };
}
// Consider adding URL path validation
const pathPart = trimmed.split('/').slice(1).join('/');
if (pathPart && !/^[a-zA-Z0-9-._~:/?#\[\]@!$&'()*+,;=%]*$/.test(pathPart)) {
return { valid: false, error: 'Invalid path characters' };
}2. Inconsistent Error Handling ( When the default domain list is not found (line 43), the function returns Suggestion: if (!defaultList) {
// Either throw an error or log a warning, but be explicit about the fallback behavior
console.warn('Default domain list not found, proceeding with unrestricted research');
return {
domains: [],
errors: [],
hasRestrictions: false
};
}3. Component Accessibility ( The remove button implementation is good with <button
type="button"
class="remove-domain"
on:click={() => removeCustomDomain(domain)}
aria-label="Remove {domain}"
title="Remove {domain}" <!-- Add tooltip -->
>×</button>4. Magic Numbers ( The // In a config file
export const WEB_SEARCH_MAX_USES = 5;
// In the API
model = withWebSearch(model, WEB_SEARCH_MAX_USES, domains);5. Duplicate Domain Check Case Sensitivity ( The duplicate check uses if (domainConfig.customDomains.some(d => d.toLowerCase() === normalized.toLowerCase())) {
validationError = "Domain already added";
return;
}🐛 Potential IssuesCritical1. Race Condition in Domain Resolution ( The deduplication logic creates a Set from existing domains then checks each custom domain. If two identical custom domains are added rapidly (unlikely but possible in UI), they could both pass the duplicate check: const existingSet = new Set(domains.map(d => d.toLowerCase()));
normalizedCustom.forEach(domain => {
if (!existingSet.has(domain.toLowerCase())) {
domains.push(domain);
existingSet.add(domain.toLowerCase()); // Add this line
}
});Moderate2. Missing Null Check ( The export function isUnrestricted(config?: DomainConfig): boolean {
if (!config || config.useList === null) {
return (!config?.customDomains || config.customDomains.length === 0);
}
return false;
}This works due to optional chaining, but would be clearer as: return !config?.customDomains?.length;3. Wildcard Domain Security ( The validator accepts wildcards like // After line 33
if (hasWildcard && parts.length < 3) {
return { valid: false, error: 'Wildcard must be more specific (e.g., *.example.com, not *.com)' };
}4. Config Cascade Bug ( The cascade logic checks for if (moduleConfig && moduleConfig.level !== 'selective') {
if (moduleConfig.level === 'all' && !moduleConfig.domainConfig) {
console.warn('Module config level is "all" but domainConfig is missing');
}
effectiveConfig = moduleConfig;
resolvedFrom = 'module';
}Minor5. Empty String in Custom Domains Array While the UI prevents adding empty strings, the API doesn't explicitly validate against them. The Zod schema should include: customDomains: z.array(z.string().min(1, 'Domain cannot be empty')).default([])⚡ Performance ConsiderationsPositive1. Efficient Deduplication ⭐ 2. Early Returns ⭐ Potential Improvements1. Domain Extraction Efficiency ( The import { memoize } from 'lodash'; // or implement simple memoization
export const extractDomainUrls = memoize((domainList: DomainList): Set<string> => {
// existing implementation
});2. Component Re-renders ( The <script lang="ts" context="module">
const AVAILABLE_LISTS = getAllDomainLists();
</script>
<script lang="ts">
// Use AVAILABLE_LISTS instead of availableLists
</script>3. Regex Compilation The domain validation regex ( const DOMAIN_REGEX = /^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?(\.[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?)*$/;
export function validateDomain(domain: string): DomainValidationResult {
// Use DOMAIN_REGEX
}4. Array Spreads in Event Handlers ( Using spread operators in event handlers creates new arrays: domainConfig.customDomains = [...domainConfig.customDomains, normalized];For Svelte's reactivity, this is necessary, but document that this is intentional for reactivity triggers. 🔒 Security ConcernsHigh Priority1. SSRF Risk with Custom Domains Users can add custom domains that could potentially be used for Server-Side Request Forgery (SSRF) attacks if the web search tool makes requests to internal networks. Consider:
function isPrivateIP(domain: string): boolean {
const privateRanges = [
/^localhost$/i,
/^127\./, // 127.0.0.0/8
/^10\./, // 10.0.0.0/8
/^172\.(1[6-9]|2[0-9]|3[0-1])\./, // 172.16.0.0/12
/^192\.168\./, // 192.168.0.0/16
/^169\.254\./ // 169.254.0.0/16
];
return privateRanges.some(regex => regex.test(domain));
}
// In validateDomain():
if (isPrivateIP(domainWithoutPort)) {
return { valid: false, error: 'Private IP addresses are not allowed' };
}
2. Wildcard Domain Exploitation As mentioned, if (hasWildcard) {
const remainingParts = domainWithoutPort.split('.');
if (remainingParts.length < 2) {
return { valid: false, error: 'Wildcard must include at least a domain and TLD (e.g., *.example.com)' };
}
}Medium Priority3. XSS via Domain Display Domain strings are displayed in the UI ( // Add to validation
if (/<|>|&|"/.test(trimmed)) {
return { valid: false, error: 'Domain cannot contain HTML characters' };
}4. Rate Limiting Custom Domain Addition No rate limiting on custom domain additions. A malicious user could add thousands of custom domains. Consider:
// In apiValidator.ts
customDomains: z.array(z.string()).max(50, 'Maximum 50 custom domains allowed').default([])Low Priority5. Domain List Injection The const VALID_LIST_IDS = ['ai-engineering'] as const;
export const DomainConfigSchema = z.object({
useList: z.enum([...VALID_LIST_IDS, null as any]).nullable(),
customDomains: z.array(z.string()).default([])
});✅ Test CoverageMissing Test CoverageThis PR adds substantial functionality but does not include any tests. The following areas critically need test coverage: 1. Domain Validation ( Priority: HIGH describe('validateDomain', () => {
it('should accept valid domains', () => {
expect(validateDomain('example.com').valid).toBe(true);
expect(validateDomain('sub.example.com').valid).toBe(true);
});
it('should accept wildcards', () => {
expect(validateDomain('*.github.com').valid).toBe(true);
});
it('should accept subpaths', () => {
expect(validateDomain('example.com/blog').valid).toBe(true);
});
it('should reject invalid formats', () => {
expect(validateDomain('example').valid).toBe(false);
expect(validateDomain('.com').valid).toBe(false);
expect(validateDomain('example..com').valid).toBe(false);
expect(validateDomain('*.com').valid).toBe(false); // Too broad
});
it('should reject domains with spaces', () => {
expect(validateDomain('example .com').valid).toBe(false);
});
it('should normalize domains', () => {
expect(validateDomain('EXAMPLE.COM').normalized).toBe('example.com');
});
});2. Domain Resolution ( Priority: HIGH describe('resolveDomainList', () => {
it('should return AI Engineering list by default', () => {
const result = resolveDomainList();
expect(result.domains.length).toBeGreaterThan(0);
expect(result.hasRestrictions).toBe(true);
});
it('should return empty array for no restrictions', () => {
const result = resolveDomainList({ useList: null, customDomains: [] });
expect(result.domains).toEqual([]);
expect(result.hasRestrictions).toBe(false);
});
it('should merge predefined list with custom domains', () => {
const result = resolveDomainList({
useList: 'ai-engineering',
customDomains: ['example.com']
});
expect(result.domains).toContain('example.com');
expect(result.domains.length).toBeGreaterThan(1);
});
it('should deduplicate domains', () => {
const result = resolveDomainList({
useList: 'ai-engineering',
customDomains: ['anthropic.com'] // Already in AI list
});
const uniqueDomains = new Set(result.domains);
expect(uniqueDomains.size).toBe(result.domains.length);
});
it('should handle invalid custom domains', () => {
const result = resolveDomainList({
useList: 'ai-engineering',
customDomains: ['invalid', 'example.com']
});
expect(result.errors.length).toBeGreaterThan(0);
expect(result.domains).toContain('example.com');
});
});3. Hierarchical Config Resolution ( Priority: HIGH describe('resolveModuleResearchConfig', () => {
it('should prioritize module config over arc and course', () => {
const result = resolveModuleResearchConfig(
{ level: 'none' },
{ level: 'all', domainConfig: { useList: 'ai-engineering', customDomains: [] } },
{ level: 'all', domainConfig: { useList: 'ai-engineering', customDomains: [] } }
);
expect(result.enabled).toBe(false);
expect(result.resolvedFrom).toBe('module');
});
it('should cascade to arc config when module is selective', () => {
const result = resolveModuleResearchConfig(
{ level: 'selective' },
{ level: 'all', domainConfig: { useList: 'ai-engineering', customDomains: [] } },
{ level: 'none' }
);
expect(result.enabled).toBe(true);
expect(result.resolvedFrom).toBe('arc');
});
it('should use default when all configs are selective', () => {
const result = resolveModuleResearchConfig(
{ level: 'selective' },
{ level: 'selective' },
{ level: 'selective' }
);
expect(result.enabled).toBe(true);
expect(result.resolvedFrom).toBe('default');
expect(result.domainConfig?.useList).toBe('ai-engineering');
});
});4. Component Testing ( Priority: MEDIUM Use
5. API Integration Tests Priority: MEDIUM describe('POST /api/metis/update', () => {
it('should accept valid domain config', async () => {
const response = await fetch('/api/metis/update', {
method: 'POST',
body: JSON.stringify({
projectsData: mockProjectsXML,
skillsData: mockSkillsXML,
researchData: mockResearchXML,
enableResearch: true,
domainConfig: {
useList: 'ai-engineering',
customDomains: ['example.com']
}
})
});
expect(response.ok).toBe(true);
});
it('should reject invalid domain config', async () => {
const response = await fetch('/api/metis/update', {
method: 'POST',
body: JSON.stringify({
projectsData: mockProjectsXML,
skillsData: mockSkillsXML,
researchData: mockResearchXML,
enableResearch: true,
domainConfig: {
useList: 'ai-engineering',
customDomains: ['invalid'] // No TLD
}
})
});
expect(response.status).toBe(400);
});
});Recommended Testing Framework Setupnpm install -D vitest @testing-library/svelte @testing-library/jest-domCreate import { defineConfig } from 'vitest/config';
import { svelte } from '@sveltejs/vite-plugin-svelte';
export default defineConfig({
plugins: [svelte({ hot: !process.env.VITEST })],
test: {
globals: true,
environment: 'jsdom',
setupFiles: ['./vitest.setup.ts']
}
});RecommendationsBefore Merging (High Priority)
After Merging (Medium Priority)
Future Enhancements (Low Priority)
Overall AssessmentThis is a well-designed and thoughtfully implemented feature that significantly enhances Rhea's flexibility. The hierarchical configuration system is particularly elegant, and the documentation is exemplary. The main concerns are around security (SSRF risks) and the absence of tests for this substantial new functionality. Recommended action: Address security concerns and add basic test coverage before merging. The other improvements can be tackled in follow-up PRs. Great work! 🎉 |
Domain Whitelist Configuration for Research
Overview
This PR implements a comprehensive domain whitelist system allowing users to control which websites Claude can access during curriculum generation. The feature works across both Metis (standalone modules) and Themis (course builder) with hierarchical configuration support.
Users can now:
Tip
No action required after pulling - all changes are backward compatible. Existing modules will continue using the default "AI Engineering" domain list.
Changes
Type System & Configuration Foundation
Restructured domain configuration from flat arrays to organized, extensible structures.
src/lib/types/config.tswith type definitions for domain lists, configs, and hierarchical researchsrc/lib/config/researchDomains.tsinto categorized format (7 categories, 29 domains)AI_RESEARCH_DOMAINS_FLATexportextractDomainUrls(),getDomainListById(),getAllDomainLists()Validation & Resolution Utilities
Created utilities for domain validation and configuration resolution.
Domain Validator (
src/lib/utils/validation/domainValidator.ts):*.github.com)example.com/blog)Domain Resolver (
src/lib/utils/research/domainResolver.ts):Config Resolver (
src/lib/utils/research/configResolver.ts):Metis UI Components
Added domain configuration to standalone module workflow.
Created
src/lib/components/metis/DomainSelector.svelte:Updated
src/lib/components/metis/StructuredInputForm.svelte:Updated
src/lib/stores/metisStores.tswith default domain configMetis API Integration
Wired domain configuration through Metis generation API.
Updated
src/routes/api/metis/update/+server.ts:getDomains()utilitywithWebSearch()Updated
src/lib/factories/agents/agentClientFactory.ts:Themis Hierarchical Configuration
Implemented three-level hierarchical research config in course builder.
Created
src/lib/components/themis/ResearchConfigSelector.svelte:Updated Themis components:
Updated
src/lib/types/themis.tswithresearchConfigfieldsUpdated Themis APIs:
src/routes/api/themis/generate/+server.tssrc/routes/api/themis/module/+server.tsSchema Updates
Extended Zod validation schemas for domain configuration.
DomainConfigSchemawith custom validation refinementResearchConfigSchemafor Themis hierarchyStructuredInputSchema,GenerateRequestSchema,ModuleGenerationRequestSchemaDocumentation
Comprehensive user and developer documentation.
Updated
README.md:Created
docs/dev/architecture/domain-whitelist-adr.md:Created
docs/wip/domain-whitelist-implementation.md:Summary
Imagine you're a bouncer at an exclusive nightclub called "Research Plaza." Previously, you had a rigid velvet rope with exactly 40 people's names permanently stitched onto it - no flexibility, no changes, just those 40 folks forever. Now you've been upgraded to a fancy iPad with multiple guest lists! You can pick the "VIP Tech Crowd" list (29 carefully curated guests), add your own friends to the list on the fly ("Oh hey,
*.github.comis cool, let them in"), or go full Burning Man and just remove the velvet rope entirely ("No Restrictions - everyone's invited!"). Plus, if you're managing multiple rooms (that's Themis), you can set different policies for each: the main ballroom follows house rules, but the jazz lounge can configure per-band, and the karaoke room lets each singer decide if they want backing vocals from the internet. It's like going from a dictator's guest list to a choose-your-own-adventure bouncer simulator.