Name	Name	Last commit message	Last commit date
parent directory ..
src	src
LICENSE	LICENSE
README.md	README.md
package.json	package.json
tsconfig.json	tsconfig.json
vitest.config.ts	vitest.config.ts

@reaatech/classifier-evals-cli

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

CLI tool for classifier evaluation. Run evaluations, compare models, check regression gates, run LLM-as-judge, and export results — all from the command line. Built on Commander.js and the full @reaatech/classifier-evals-* ecosystem.

Installation

npm install -g @reaatech/classifier-evals-cli
# or
pnpm add @reaatech/classifier-evals-cli

Quick Start

# Run a full evaluation
classifier-evals eval --dataset test-set.csv --format json --output results.json

# Compare two models
classifier-evals compare --baseline results/v1.json --candidate results/v2.json

# Check regression gates
classifier-evals gates --results results/latest.json --gates gates.yaml

# Run LLM-as-judge
classifier-evals judge --dataset test-set.csv --model claude-opus --budget 10.00

# Export a report
classifier-evals export --results results/latest.json --format html --output report.html

Commands

`eval`

Run a full classifier evaluation against a dataset.

classifier-evals eval --dataset test-set.csv [options]

Option	Type	Default	Description
`--dataset`	`string`	(required)	Path to dataset file (CSV, JSON, JSONL)
`--format`	`"json" \| "html"`	`"json"`	Output format
`--output`	`string`	—	Output file path (writes to stdout if omitted)
`--name`	`string`	—	Dataset display name (defaults to filename)

Loads the dataset, computes the confusion matrix and all 14 classification metrics, builds an EvalRun, and exports the results.

`compare`

Compare two model evaluation results with statistical significance.

classifier-evals compare --baseline results/v1.json --candidate results/v2.json

Option	Type	Default	Description
`--baseline`	`string`	(required)	Path to baseline evaluation results JSON
`--candidate`	`string`	(required)	Path to candidate evaluation results JSON
`--output`	`string`	—	Output file path (writes to stdout if omitted)

Computes accuracy difference, McNemar's test p-value, Cohen's d effect size, and per-class F1 comparison.

`gates`

Evaluate regression gates against evaluation results.

classifier-evals gates --results results/latest.json --gates gates.yaml [options]

Option	Type	Default	Description
`--results`	`string`	(required)	Path to evaluation results JSON
`--gates`	`string`	(required)	Path to gate configuration YAML
`--baseline`	`string`	—	Path to baseline results for comparison gates
`--output`	`string`	—	Output file path
`--format`	`"text" \| "junit"`	`"text"`	Output format (text or JUnit XML)

Exits with code 1 if any gate fails, 0 if all pass. Suitable for CI pipelines.

`judge`

Run LLM-as-judge on samples with cost tracking.

classifier-evals judge --dataset test-set.csv --model claude-opus [options]

Option	Type	Default	Description
`--dataset`	`string`	(required)	Path to dataset file
`--model`	`string`	(required)	LLM model for judging
`--consensus`	`number`	`1`	Number of judges for consensus voting
`--budget`	`number`	`50.00`	Maximum budget in USD
`--template`	`string`	`"classification-eval"`	Prompt template name
`--output`	`string`	—	Output file path

Evaluates each sample, applies consensus voting if --consensus > 1, tracks costs in real-time, and exports the judged results.

`export`

Generate a report from evaluation results.

classifier-evals export --results results/latest.json --format html --output report.html

Option	Type	Default	Description
`--results`	`string`	(required)	Path to evaluation results JSON
`--format`	`"json" \| "html"`	`"json"`	Report format
`--output`	`string`	—	Output file path
`--phoenix`	`string`	—	Phoenix endpoint URL
`--langfuse`	—	—	Export to Langfuse (uses env vars for auth)

Supports HTML reports with inline SVGs, JSON output, Phoenix traces, and Langfuse ingestion.

CI Integration

GitHub Actions

name: Classifier Evaluation
on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 22
          cache: 'pnpm'
      - run: pnpm install --frozen-lockfile
      - run: pnpm build

      - name: Run evaluation
        run: |
          mkdir -p results
          classifier-evals eval \
            --dataset datasets/examples/sample.csv \
            --format json \
            --output results/latest.json

      - name: Check gates
        run: |
          classifier-evals gates \
            --results results/latest.json \
            --gates datasets/examples/gates.yaml

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/
          retention-days: 30

Exit Codes

Code	Meaning
`0`	Success — all gates passed, or comparison completed
`1`	Gate failure — one or more regression gates did not pass
`2`	Error — invalid arguments, missing files, or runtime error

Usage Patterns

Full Pipeline

# 1. Evaluate
classifier-evals eval --dataset production.csv --format json --output prod.json

# 2. Check gates
classifier-evals gates --results prod.json --gates production-gates.yaml

# 3. Generate HTML report
classifier-evals export --results prod.json --format html --output report.html

# 4. Compare against previous release
classifier-evals compare --baseline prod-v1.json --candidate prod.json

LLM-as-Judge Pipeline

# Judge misclassifications with multiple models
classifier-evals judge \
  --dataset errors.csv \
  --model claude-opus \
  --consensus 3 \
  --budget 25.00 \
  --output judged-results.json

Related Packages

@reaatech/classifier-evals — Core types and schemas
@reaatech/classifier-evals-dataset — Dataset loading
@reaatech/classifier-evals-metrics — Confusion matrix and metrics
@reaatech/classifier-evals-judge — LLM-as-judge
@reaatech/classifier-evals-gates — Regression gates
@reaatech/classifier-evals-exporters — JSON, HTML, Phoenix, Langfuse exporters
@reaatech/classifier-evals-mcp-server — MCP server alternative

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

@reaatech/classifier-evals-cli

Installation

Quick Start

Commands

`eval`

`compare`

`gates`

`judge`

`export`

CI Integration

GitHub Actions

Exit Codes

Usage Patterns

Full Pipeline

LLM-as-Judge Pipeline

Related Packages

License

FilesExpand file tree

cli

Directory actions

More options

Directory actions

More options

Latest commit

History

cli

Folders and files

parent directory

README.md

@reaatech/classifier-evals-cli

Installation

Quick Start

Commands

eval

compare

gates

judge

export

CI Integration

GitHub Actions

Exit Codes

Usage Patterns

Full Pipeline

LLM-as-Judge Pipeline

Related Packages

License

`eval`

`compare`

`gates`

`judge`

`export`