Skip to content

Add Early Input Validation in CLI #16

Description

@GenCEO

The CLI accepts a config file but doesn't validate it until late in the execution pipeline. This means:

  • Users wait through expensive setup (model loading, parsing) only to fail on invalid config
  • Error messages appear after significant processing
  • No quick feedback loop

Current Flow

1. User runs: sdg generate config.yaml
2. Config file loaded
3. Document parsing starts (slow!)
4. Model initialization begins
5. ❌ ERROR: Invalid generation config

User wasted 2-3 minutes before seeing the error.

Proposed Flow

1. User runs: sdg generate config.yaml
2. ✅ Config validation (< 1 second)
3. ❌ ERROR: Invalid generation config
   Exit immediately

Implementation

Add Validation Command

# In cli.py
@app.command()
def validate(config_path: Path):
    """Validate configuration file without running."""
    try:
        config = SDGConfig.from_yaml(config_path)
        console.print("[green]✓ Configuration is valid[/green]")

        # Show warnings
        if config.generation.num_samples > 10000:
            console.print("[yellow]⚠ Large sample count may take hours[/yellow]")

        # Estimate resources
        estimated_tokens = estimate_token_usage(config)
        estimated_cost = estimated_tokens * 0.00001  # Example pricing
        console.print(f"[blue]Estimated cost: ${estimated_cost:.2f}[/blue]")

    except ValidationError as e:
        console.print(f"[red]✗ Invalid configuration:[/red]")
        for error in e.errors():
            console.print(f"  {error['loc']}: {error['msg']}")
        sys.exit(1)

Enhance Main Command

@app.command()
def generate(
    config_path: Path,
    validate_only: bool = typer.Option(False, "--validate-only", help="Only validate config")
):
    """Generate synthetic data."""

    # Early validation
    try:
        config = SDGConfig.from_yaml(config_path)
    except ValidationError as e:
        console.print("[red]Configuration errors:[/red]")
        for error in e.errors():
            field = " → ".join(str(loc) for loc in error['loc'])
            console.print(f"  [yellow]{field}[/yellow]: {error['msg']}")
        raise typer.Exit(1)

    if validate_only:
        console.print("[green]✓ Configuration is valid[/green]")
        raise typer.Exit(0)

    # Check prerequisites
    if config.task.method == "local" and not config.task.document_path:
        console.print("[red]Error: document_path required for local method[/red]")
        raise typer.Exit(1)

    if config.task.method == "web" and not config.task.dataset_id:
        console.print("[red]Error: dataset_id required for web method[/red]")
        raise typer.Exit(1)

    # File existence checks
    if config.task.method == "local":
        doc_path = Path(config.task.document_path)
        if not doc_path.exists():
            console.print(f"[red]Error: Document not found: {doc_path}[/red]")
            raise typer.Exit(1)

    # Continue with execution...

Add Pre-flight Checks

def validate_environment(config: SDGConfig) -> list[str]:
    """Check environment prerequisites."""
    warnings = []

    # Check GPU availability
    if config.model.device == "cuda" and not torch.cuda.is_available():
        warnings.append("CUDA requested but not available. Falling back to CPU.")

    # Check API keys
    if config.model.provider == "openai" and not os.getenv("OPENAI_API_KEY"):
        raise ValueError("OPENAI_API_KEY not set")

    if config.model.provider == "anthropic" and not os.getenv("ANTHROPIC_API_KEY"):
        raise ValueError("ANTHROPIC_API_KEY not set")

    # Check disk space
    cache_dir = Path(".cache")
    stat = os.statvfs(cache_dir)
    free_gb = (stat.f_bavail * stat.f_frsize) / (1024**3)
    if free_gb < 10:
        warnings.append(f"Low disk space: {free_gb:.1f}GB free")

    return warnings

Example Output

$ sdg generate config.yaml

⚙ Validating configuration...
✓ Config structure valid
✓ Model configuration valid
✓ Task configuration valid

⚠ Warnings:
  • Large sample count (10000) - estimated 2-3 hours
  • GPU not available - using CPU (slower)

💰 Cost Estimate:
  • Tokens: ~500,000
  • Estimated cost: $5.00 (gpt-4)

📊 Resource Estimate:
  • Time: 2-3 hours
  • Disk space: ~2GB

Continue? [y/N]:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions