llmtest

LLM testing for Go. Extends go test with eval scorers, no new tools to learn.

Install

go get github.com/adamwoolhether/llmtest

Quick Start

Already using strings.Contains in your tests? Replace it with llmtest.Contains. Same logic, now with structured results, scorer metrics, and a path to LLM-as-judge when you're ready.

package myapp_test

import (
    "testing"

    "github.com/adamwoolhether/llmtest"
)

func TestGreeting(t *testing.T) {
    llmtest.Run(t, "polite", func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        "Say hello",
            ActualOutput: callMyLLM("Say hello"),
        })
        e.Require(llmtest.Contains("hello"))
        e.Check(llmtest.LengthBetween(1, 500))
    })
}

Run with:

LLMTEST=1 go test ./...

Tests are skipped unless LLMTEST=1 is set, so evals never run accidentally during normal development.

Zero-Cost Quickstart (Ollama)

No API key needed. Run evals locally with Ollama:

# Install Ollama and pull a model
ollama pull llama3.2

# Run your eval tests (llmtest auto-detects Ollama)
LLMTEST=1 LLMTEST_PROVIDER=ollama go test -v ./...

Free, private, runs on your laptop. Swap to OpenAI or Anthropic later by setting the API key.

Core Workflow

Run(t, name, func(e *E) {
    e.Case(tc)            // set the test case
    e.Config(opts...)     // optional: set eval-level options
    e.Require(scorer)     // hard constraint: stops on failure
    e.Check(scorer)       // soft constraint: records but continues
})

Require vs Check: Require calls t.FailNow on failure, so the test stops immediately. Check marks the test as failed but does not stop it; remaining scorers still execute. Use Require for hard constraints and Check for soft/informational metrics.

Run does not call t.Parallel(). Call it on t before Run if you want parallel subtests.

Verdicts

Every scorer returns one of three verdicts:

Verdict	Score	Meaning
`Pass`	1.0	Criterion fully met
`Partial`	0.5	Criterion partially met
`Fail`	0.0	Criterion not met

Deterministic scorers only return Pass or Fail. The Partial verdict is used by LLM-based scorers like Rubric.

TestCase Fields

Field	Type	Description
`Input`	`string`	The prompt sent to the LLM under test
`ActualOutput`	`string`	The LLM's response (the text being evaluated)
`ExpectedOutput`	`string`	Reference/ideal answer (optional)
`Context`	`[]string`	Background info given to the LLM (optional)
`RetrievalContext`	`[]string`	Documents retrieved by a RAG pipeline (optional)
`Metadata`	`map[string]any`	Arbitrary data for custom scorers

Deterministic Scorers

These run locally with no network calls.

Substring

Scorer	Description
`Contains(s)`	Output contains substring `s`
`ContainsAll(s..)`	Output contains every substring
`ContainsAny(s..)`	Output contains at least one substring
`NotContains(s)`	Output does not contain `s`
`NotContainsAny(s..)`	Output contains none of the substrings

Regex

Scorer	Description
`MatchesRegex(p)`	Output matches regex pattern `p`
`NotMatchesRegex(p)`	Output does not match regex pattern `p`

JSON

Scorer	Description
`IsJSON()`	Output is valid JSON
`MatchesJSONSchema(s)`	Output conforms to JSON Schema `s` (reports schema errors at scoring time)
`MustMatchJSONSchema(s)`	Same, but panics on invalid schema (compile-time safety)

Structure

Scorer	Description
`LengthBetween(min, max)`	Output byte length is in `[min, max]`
`ContainsTag(tag)`	Output has `<tag>...</tag>` pair
`ExtractTag(tag, inner)`	Extract tag content, then apply `inner` scorer

Scorer Composition

ExtractTag enables composition: extract content from a tag, then evaluate it:

// Verify the <answer> tag contains "42"
llmtest.ExtractTag("answer", llmtest.Contains("42"))

Rubric Scorer

Rubric is an LLM-based scorer that sends the test case to a judge model for criterion evaluation.

e.Require(llmtest.Rubric("Response is polite and professional"))

The judge returns one of three verdicts:

Pass (1.0): criterion fully met
Partial (0.5): criterion partially met
Fail (0.0): criterion not met

Options

Option	Description
`Model(m)`	Override the judge model for this scorer
`Provider(p)`	Override the provider for this scorer
`Threshold(t)`	Minimum score to pass (default 1.0; set to 0.5 to accept Partial)
`ConsistencyCheck(n)`	Run `n` times, take majority verdict

e.Require(llmtest.Rubric("Is polite",
    llmtest.Model("gpt-4o"),
    llmtest.Threshold(0.5),
))

LLM calls include automatic rate-limit retry with backoff and JSON repair (re-prompts if the judge returns invalid JSON).

Configuration Priority

LLM-based scorers resolve model and provider in this order:

Scorer-level: Model(), Provider() options on individual scorers
Eval-level: EvalModel(), EvalProvider() via e.Config()
Environment: LLMTEST_MODEL, LLMTEST_PROVIDER
Auto-detection: probe API keys / local services

Providers

Provider	Constructor	Default Model	Required Env Var
OpenAI	`OpenAI()`	`gpt-4.1-mini`	`OPENAI_API_KEY`
Anthropic	`Anthropic()`	`claude-sonnet-4-5-20250929`	`ANTHROPIC_API_KEY`
Ollama	`Ollama()`	`llama3.2`	(none, local service)

Auto-detection order when LLMTEST_PROVIDER is unset:

OPENAI_API_KEY present → OpenAI
ANTHROPIC_API_KEY present → Anthropic
Ollama reachable → Ollama

Most users don't need to import provider directly. Just set the environment variable.

Environment Variables

Variable	Description	Default
`LLMTEST`	Set to `1` to enable eval tests	(unset = skip all)
`LLMTEST_PROVIDER`	Provider: `openai`, `anthropic`, or `ollama`	auto-detect
`LLMTEST_MODEL`	Override default model	provider default
`LLMTEST_CONCURRENCY`	Max parallel LLM calls	`5`
`LLMTEST_NO_CACHE`	Set to `1` to bypass response cache	(unset = cache enabled)
`LLMTEST_OUTPUT`	Path to write JSON summary	(unset = no file)
`LLMTEST_OLLAMA_URL`	Ollama endpoint	`http://localhost:11434`

Patterns

Table-Driven Tests

cases := []struct {
    name   string
    input  string
    output string
}{
    {"greeting", "Say hi", "Hello there!"},
    {"farewell", "Say bye", "Goodbye!"},
}

for _, tc := range cases {
    llmtest.Run(t, tc.name, func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        tc.input,
            ActualOutput: tc.output,
        })
        e.Require(llmtest.LengthBetween(1, 200))
    })
}

Call Run directly in the loop. Do not wrap it in another t.Run, or you'll get double-nested test names.

Combining Scorers

llmtest.Run(t, "structured_response", func(e *llmtest.E) {
    e.Case(llmtest.TestCase{
        Input:        "Explain Go interfaces",
        ActualOutput: response,
    })
    e.Require(llmtest.IsJSON())
    e.Require(llmtest.Contains("interface"))
    e.Check(llmtest.LengthBetween(100, 2000))
    e.Check(llmtest.Rubric("Explanation is clear and accurate"))
})

JSON Output

Collect structured results by setting LLMTEST_OUTPUT and calling Flush from TestMain:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}

LLMTEST=1 LLMTEST_OUTPUT=results.json go test ./...

Caching & Concurrency

LLM responses are cached to disk in a .llmtest-cache/ directory (created in the working directory). Cache keys are derived from scorer name, prompt, and model. Entries expire after 24 hours. Add .llmtest-cache/ to your .gitignore.
Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.
Concurrent LLM calls are limited by LLMTEST_CONCURRENCY (default: 5).

Structured Test Attributes (`T.Attr`)

Every scorer call emits structured key-value attributes via testing.T.Attr (Go 1.25+):

Attribute Key	Value
`llmtest.scorer`	Scorer name, e.g. `Contains("hello")`
`llmtest.verdict`	`PASS`, `PARTIAL`, or `FAIL`
`llmtest.score`	Numeric score: `1.00`, `0.50`, `0.00`
`llmtest.reason`	Human-readable explanation
`llmtest.tokens`	LLM tokens consumed (0 for deterministic)
`llmtest.latency_ms`	Scorer wall-clock time in ms

Attributes are visible in go test -v output and machine-readable via go test -json. Eval results are structured data inside go test, not a separate tool.

CI Integration

Run evals in GitHub Actions with go test -json to get machine-readable results:

name: LLM Evals
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.25'
      - name: Run evals
        env:
          LLMTEST: "1"
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: go test -json ./... | tee eval-results.json

The go test -json output contains llmtest.* attributes for each scorer result. Parse them downstream for dashboards, alerts, or trend tracking.

Comparison

Feature	llmtest	promptfoo	deepeval	goeval	maragu.dev/gai/eval
Language	Go	YAML/Node	Python	Go	Go
Runs in `go test`	✅	❌ (separate CLI)	❌ (pytest)	❌	✅
Structured attrs (`T.Attr`)	✅	❌	❌	❌	❌
LLM judge	✅	✅	✅	✅	⚠️ minimal
Deterministic scorers	✅	✅	✅	⚠️ limited	⚠️ minimal
Config format	Go code	YAML	Python	Go code	Go code

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
examples		examples
internal		internal
provider		provider
testdata/calibration		testdata/calibration
tests		tests
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
doc.go		doc.go
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
llmtest.go		llmtest.go
llmtest_test.go		llmtest_test.go
options.go		options.go
scorer_calibration_test.go		scorer_calibration_test.go
scorer_deterministic.go		scorer_deterministic.go
scorer_deterministic_test.go		scorer_deterministic_test.go
scorer_rubric.go		scorer_rubric.go
summary.go		summary.go
types.go		types.go
version_check.go		version_check.go

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

llmtest

Install

Quick Start

Zero-Cost Quickstart (Ollama)

Core Workflow

Verdicts

TestCase Fields

Deterministic Scorers

Substring

Regex

JSON

Structure

Scorer Composition

Rubric Scorer

Options

Configuration Priority

Providers

Environment Variables

Patterns

Table-Driven Tests

Combining Scorers

JSON Output

Caching & Concurrency

Structured Test Attributes (T.Attr)

CI Integration

Comparison

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Structured Test Attributes (`T.Attr`)

Packages