Skip to content

adamwoolhether/llmtest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go Reference (client)

llmtest

LLM testing for Go. Extends go test with eval scorers, no new tools to learn.

Install

go get github.com/adamwoolhether/llmtest

Quick Start

Already using strings.Contains in your tests? Replace it with llmtest.Contains. Same logic, now with structured results, scorer metrics, and a path to LLM-as-judge when you're ready.

package myapp_test

import (
    "testing"

    "github.com/adamwoolhether/llmtest"
)

func TestGreeting(t *testing.T) {
    llmtest.Run(t, "polite", func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        "Say hello",
            ActualOutput: callMyLLM("Say hello"),
        })
        e.Require(llmtest.Contains("hello"))
        e.Check(llmtest.LengthBetween(1, 500))
    })
}

Run with:

LLMTEST=1 go test ./...

Tests are skipped unless LLMTEST=1 is set, so evals never run accidentally during normal development.

Zero-Cost Quickstart (Ollama)

No API key needed. Run evals locally with Ollama:

# Install Ollama and pull a model
ollama pull llama3.2

# Run your eval tests (llmtest auto-detects Ollama)
LLMTEST=1 LLMTEST_PROVIDER=ollama go test -v ./...

Free, private, runs on your laptop. Swap to OpenAI or Anthropic later by setting the API key.

Core Workflow

Run(t, name, func(e *E) {
    e.Case(tc)            // set the test case
    e.Config(opts...)     // optional: set eval-level options
    e.Require(scorer)     // hard constraint: stops on failure
    e.Check(scorer)       // soft constraint: records but continues
})

Require vs Check: Require calls t.FailNow on failure, so the test stops immediately. Check marks the test as failed but does not stop it; remaining scorers still execute. Use Require for hard constraints and Check for soft/informational metrics.

Run does not call t.Parallel(). Call it on t before Run if you want parallel subtests.

Verdicts

Every scorer returns one of three verdicts:

Verdict Score Meaning
Pass 1.0 Criterion fully met
Partial 0.5 Criterion partially met
Fail 0.0 Criterion not met

Deterministic scorers only return Pass or Fail. The Partial verdict is used by LLM-based scorers like Rubric.

TestCase Fields

Field Type Description
Input string The prompt sent to the LLM under test
ActualOutput string The LLM's response (the text being evaluated)
ExpectedOutput string Reference/ideal answer (optional)
Context []string Background info given to the LLM (optional)
RetrievalContext []string Documents retrieved by a RAG pipeline (optional)
Metadata map[string]any Arbitrary data for custom scorers

Deterministic Scorers

These run locally with no network calls.

Substring

Scorer Description
Contains(s) Output contains substring s
ContainsAll(s..) Output contains every substring
ContainsAny(s..) Output contains at least one substring
NotContains(s) Output does not contain s
NotContainsAny(s..) Output contains none of the substrings

Regex

Scorer Description
MatchesRegex(p) Output matches regex pattern p
NotMatchesRegex(p) Output does not match regex pattern p

JSON

Scorer Description
IsJSON() Output is valid JSON
MatchesJSONSchema(s) Output conforms to JSON Schema s (reports schema errors at scoring time)
MustMatchJSONSchema(s) Same, but panics on invalid schema (compile-time safety)

Structure

Scorer Description
LengthBetween(min, max) Output byte length is in [min, max]
ContainsTag(tag) Output has <tag>...</tag> pair
ExtractTag(tag, inner) Extract tag content, then apply inner scorer

Scorer Composition

ExtractTag enables composition: extract content from a tag, then evaluate it:

// Verify the <answer> tag contains "42"
llmtest.ExtractTag("answer", llmtest.Contains("42"))

Rubric Scorer

Rubric is an LLM-based scorer that sends the test case to a judge model for criterion evaluation.

e.Require(llmtest.Rubric("Response is polite and professional"))

The judge returns one of three verdicts:

  • Pass (1.0): criterion fully met
  • Partial (0.5): criterion partially met
  • Fail (0.0): criterion not met

Options

Option Description
Model(m) Override the judge model for this scorer
Provider(p) Override the provider for this scorer
Threshold(t) Minimum score to pass (default 1.0; set to 0.5 to accept Partial)
ConsistencyCheck(n) Run n times, take majority verdict
e.Require(llmtest.Rubric("Is polite",
    llmtest.Model("gpt-4o"),
    llmtest.Threshold(0.5),
))

LLM calls include automatic rate-limit retry with backoff and JSON repair (re-prompts if the judge returns invalid JSON).

Configuration Priority

LLM-based scorers resolve model and provider in this order:

  1. Scorer-level: Model(), Provider() options on individual scorers
  2. Eval-level: EvalModel(), EvalProvider() via e.Config()
  3. Environment: LLMTEST_MODEL, LLMTEST_PROVIDER
  4. Auto-detection: probe API keys / local services

Providers

Provider Constructor Default Model Required Env Var
OpenAI OpenAI() gpt-4.1-mini OPENAI_API_KEY
Anthropic Anthropic() claude-sonnet-4-5-20250929 ANTHROPIC_API_KEY
Ollama Ollama() llama3.2 (none, local service)

Auto-detection order when LLMTEST_PROVIDER is unset:

  1. OPENAI_API_KEY present → OpenAI
  2. ANTHROPIC_API_KEY present → Anthropic
  3. Ollama reachable → Ollama

Most users don't need to import provider directly. Just set the environment variable.

Environment Variables

Variable Description Default
LLMTEST Set to 1 to enable eval tests (unset = skip all)
LLMTEST_PROVIDER Provider: openai, anthropic, or ollama auto-detect
LLMTEST_MODEL Override default model provider default
LLMTEST_CONCURRENCY Max parallel LLM calls 5
LLMTEST_NO_CACHE Set to 1 to bypass response cache (unset = cache enabled)
LLMTEST_OUTPUT Path to write JSON summary (unset = no file)
LLMTEST_OLLAMA_URL Ollama endpoint http://localhost:11434

Patterns

Table-Driven Tests

cases := []struct {
    name   string
    input  string
    output string
}{
    {"greeting", "Say hi", "Hello there!"},
    {"farewell", "Say bye", "Goodbye!"},
}

for _, tc := range cases {
    llmtest.Run(t, tc.name, func(e *llmtest.E) {
        e.Case(llmtest.TestCase{
            Input:        tc.input,
            ActualOutput: tc.output,
        })
        e.Require(llmtest.LengthBetween(1, 200))
    })
}

Call Run directly in the loop. Do not wrap it in another t.Run, or you'll get double-nested test names.

Combining Scorers

llmtest.Run(t, "structured_response", func(e *llmtest.E) {
    e.Case(llmtest.TestCase{
        Input:        "Explain Go interfaces",
        ActualOutput: response,
    })
    e.Require(llmtest.IsJSON())
    e.Require(llmtest.Contains("interface"))
    e.Check(llmtest.LengthBetween(100, 2000))
    e.Check(llmtest.Rubric("Explanation is clear and accurate"))
})

JSON Output

Collect structured results by setting LLMTEST_OUTPUT and calling Flush from TestMain:

func TestMain(m *testing.M) {
    code := m.Run()
    llmtest.Flush()
    os.Exit(code)
}
LLMTEST=1 LLMTEST_OUTPUT=results.json go test ./...

Caching & Concurrency

  • LLM responses are cached to disk in a .llmtest-cache/ directory (created in the working directory). Cache keys are derived from scorer name, prompt, and model. Entries expire after 24 hours. Add .llmtest-cache/ to your .gitignore.
  • Set LLMTEST_NO_CACHE=1 to bypass both reads and writes.
  • Concurrent LLM calls are limited by LLMTEST_CONCURRENCY (default: 5).

Structured Test Attributes (T.Attr)

Every scorer call emits structured key-value attributes via testing.T.Attr (Go 1.25+):

Attribute Key Value
llmtest.scorer Scorer name, e.g. Contains("hello")
llmtest.verdict PASS, PARTIAL, or FAIL
llmtest.score Numeric score: 1.00, 0.50, 0.00
llmtest.reason Human-readable explanation
llmtest.tokens LLM tokens consumed (0 for deterministic)
llmtest.latency_ms Scorer wall-clock time in ms

Attributes are visible in go test -v output and machine-readable via go test -json. Eval results are structured data inside go test, not a separate tool.

CI Integration

Run evals in GitHub Actions with go test -json to get machine-readable results:

name: LLM Evals
on: [push]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.25'
      - name: Run evals
        env:
          LLMTEST: "1"
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: go test -json ./... | tee eval-results.json

The go test -json output contains llmtest.* attributes for each scorer result. Parse them downstream for dashboards, alerts, or trend tracking.

Comparison

Feature llmtest promptfoo deepeval goeval maragu.dev/gai/eval
Language Go YAML/Node Python Go Go
Runs in go test ❌ (separate CLI) ❌ (pytest)
Structured attrs (T.Attr)
LLM judge ⚠️ minimal
Deterministic scorers ⚠️ limited ⚠️ minimal
Config format Go code YAML Python Go code Go code

About

LLM testing for Go. Extends go test, no new tools to learn.

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors