LLM testing for Go. Extends go test with eval scorers, no new tools to learn.
go get github.com/adamwoolhether/llmtestAlready using strings.Contains in your tests? Replace it with llmtest.Contains. Same logic, now with structured results, scorer metrics, and a path to LLM-as-judge when you're ready.
package myapp_test
import (
"testing"
"github.com/adamwoolhether/llmtest"
)
func TestGreeting(t *testing.T) {
llmtest.Run(t, "polite", func(e *llmtest.E) {
e.Case(llmtest.TestCase{
Input: "Say hello",
ActualOutput: callMyLLM("Say hello"),
})
e.Require(llmtest.Contains("hello"))
e.Check(llmtest.LengthBetween(1, 500))
})
}Run with:
LLMTEST=1 go test ./...Tests are skipped unless LLMTEST=1 is set, so evals never run accidentally during normal development.
No API key needed. Run evals locally with Ollama:
# Install Ollama and pull a model
ollama pull llama3.2
# Run your eval tests (llmtest auto-detects Ollama)
LLMTEST=1 LLMTEST_PROVIDER=ollama go test -v ./...Free, private, runs on your laptop. Swap to OpenAI or Anthropic later by setting the API key.
Run(t, name, func(e *E) {
e.Case(tc) // set the test case
e.Config(opts...) // optional: set eval-level options
e.Require(scorer) // hard constraint: stops on failure
e.Check(scorer) // soft constraint: records but continues
})
Require vs Check: Require calls t.FailNow on failure, so the test stops immediately. Check marks the test as failed but does not stop it; remaining scorers still execute. Use Require for hard constraints and Check for soft/informational metrics.
Run does not call t.Parallel(). Call it on t before Run if you want parallel subtests.
Every scorer returns one of three verdicts:
| Verdict | Score | Meaning |
|---|---|---|
Pass |
1.0 | Criterion fully met |
Partial |
0.5 | Criterion partially met |
Fail |
0.0 | Criterion not met |
Deterministic scorers only return Pass or Fail. The Partial verdict is used by LLM-based scorers like Rubric.
| Field | Type | Description |
|---|---|---|
Input |
string |
The prompt sent to the LLM under test |
ActualOutput |
string |
The LLM's response (the text being evaluated) |
ExpectedOutput |
string |
Reference/ideal answer (optional) |
Context |
[]string |
Background info given to the LLM (optional) |
RetrievalContext |
[]string |
Documents retrieved by a RAG pipeline (optional) |
Metadata |
map[string]any |
Arbitrary data for custom scorers |
These run locally with no network calls.
| Scorer | Description |
|---|---|
Contains(s) |
Output contains substring s |
ContainsAll(s..) |
Output contains every substring |
ContainsAny(s..) |
Output contains at least one substring |
NotContains(s) |
Output does not contain s |
NotContainsAny(s..) |
Output contains none of the substrings |
| Scorer | Description |
|---|---|
MatchesRegex(p) |
Output matches regex pattern p |
NotMatchesRegex(p) |
Output does not match regex pattern p |
| Scorer | Description |
|---|---|
IsJSON() |
Output is valid JSON |
MatchesJSONSchema(s) |
Output conforms to JSON Schema s (reports schema errors at scoring time) |
MustMatchJSONSchema(s) |
Same, but panics on invalid schema (compile-time safety) |
| Scorer | Description |
|---|---|
LengthBetween(min, max) |
Output byte length is in [min, max] |
ContainsTag(tag) |
Output has <tag>...</tag> pair |
ExtractTag(tag, inner) |
Extract tag content, then apply inner scorer |
ExtractTag enables composition: extract content from a tag, then evaluate it:
// Verify the <answer> tag contains "42"
llmtest.ExtractTag("answer", llmtest.Contains("42"))Rubric is an LLM-based scorer that sends the test case to a judge model for criterion evaluation.
e.Require(llmtest.Rubric("Response is polite and professional"))The judge returns one of three verdicts:
- Pass (1.0): criterion fully met
- Partial (0.5): criterion partially met
- Fail (0.0): criterion not met
| Option | Description |
|---|---|
Model(m) |
Override the judge model for this scorer |
Provider(p) |
Override the provider for this scorer |
Threshold(t) |
Minimum score to pass (default 1.0; set to 0.5 to accept Partial) |
ConsistencyCheck(n) |
Run n times, take majority verdict |
e.Require(llmtest.Rubric("Is polite",
llmtest.Model("gpt-4o"),
llmtest.Threshold(0.5),
))LLM calls include automatic rate-limit retry with backoff and JSON repair (re-prompts if the judge returns invalid JSON).
LLM-based scorers resolve model and provider in this order:
- Scorer-level:
Model(),Provider()options on individual scorers - Eval-level:
EvalModel(),EvalProvider()viae.Config() - Environment:
LLMTEST_MODEL,LLMTEST_PROVIDER - Auto-detection: probe API keys / local services
| Provider | Constructor | Default Model | Required Env Var |
|---|---|---|---|
| OpenAI | OpenAI() |
gpt-4.1-mini |
OPENAI_API_KEY |
| Anthropic | Anthropic() |
claude-sonnet-4-5-20250929 |
ANTHROPIC_API_KEY |
| Ollama | Ollama() |
llama3.2 |
(none, local service) |
Auto-detection order when LLMTEST_PROVIDER is unset:
OPENAI_API_KEYpresent → OpenAIANTHROPIC_API_KEYpresent → Anthropic- Ollama reachable → Ollama
Most users don't need to import provider directly. Just set the environment variable.
| Variable | Description | Default |
|---|---|---|
LLMTEST |
Set to 1 to enable eval tests |
(unset = skip all) |
LLMTEST_PROVIDER |
Provider: openai, anthropic, or ollama |
auto-detect |
LLMTEST_MODEL |
Override default model | provider default |
LLMTEST_CONCURRENCY |
Max parallel LLM calls | 5 |
LLMTEST_NO_CACHE |
Set to 1 to bypass response cache |
(unset = cache enabled) |
LLMTEST_OUTPUT |
Path to write JSON summary | (unset = no file) |
LLMTEST_OLLAMA_URL |
Ollama endpoint | http://localhost:11434 |
cases := []struct {
name string
input string
output string
}{
{"greeting", "Say hi", "Hello there!"},
{"farewell", "Say bye", "Goodbye!"},
}
for _, tc := range cases {
llmtest.Run(t, tc.name, func(e *llmtest.E) {
e.Case(llmtest.TestCase{
Input: tc.input,
ActualOutput: tc.output,
})
e.Require(llmtest.LengthBetween(1, 200))
})
}Call Run directly in the loop. Do not wrap it in another t.Run, or you'll get double-nested test names.
llmtest.Run(t, "structured_response", func(e *llmtest.E) {
e.Case(llmtest.TestCase{
Input: "Explain Go interfaces",
ActualOutput: response,
})
e.Require(llmtest.IsJSON())
e.Require(llmtest.Contains("interface"))
e.Check(llmtest.LengthBetween(100, 2000))
e.Check(llmtest.Rubric("Explanation is clear and accurate"))
})Collect structured results by setting LLMTEST_OUTPUT and calling Flush from TestMain:
func TestMain(m *testing.M) {
code := m.Run()
llmtest.Flush()
os.Exit(code)
}LLMTEST=1 LLMTEST_OUTPUT=results.json go test ./...- LLM responses are cached to disk in a
.llmtest-cache/directory (created in the working directory). Cache keys are derived from scorer name, prompt, and model. Entries expire after 24 hours. Add.llmtest-cache/to your.gitignore. - Set
LLMTEST_NO_CACHE=1to bypass both reads and writes. - Concurrent LLM calls are limited by
LLMTEST_CONCURRENCY(default: 5).
Every scorer call emits structured key-value attributes via testing.T.Attr (Go 1.25+):
| Attribute Key | Value |
|---|---|
llmtest.scorer |
Scorer name, e.g. Contains("hello") |
llmtest.verdict |
PASS, PARTIAL, or FAIL |
llmtest.score |
Numeric score: 1.00, 0.50, 0.00 |
llmtest.reason |
Human-readable explanation |
llmtest.tokens |
LLM tokens consumed (0 for deterministic) |
llmtest.latency_ms |
Scorer wall-clock time in ms |
Attributes are visible in go test -v output and machine-readable via go test -json. Eval results are structured data inside go test, not a separate tool.
Run evals in GitHub Actions with go test -json to get machine-readable results:
name: LLM Evals
on: [push]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.25'
- name: Run evals
env:
LLMTEST: "1"
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: go test -json ./... | tee eval-results.jsonThe go test -json output contains llmtest.* attributes for each scorer result. Parse them downstream for dashboards, alerts, or trend tracking.
| Feature | llmtest | promptfoo | deepeval | goeval | maragu.dev/gai/eval |
|---|---|---|---|---|---|
| Language | Go | YAML/Node | Python | Go | Go |
Runs in go test |
✅ | ❌ (separate CLI) | ❌ (pytest) | ❌ | ✅ |
Structured attrs (T.Attr) |
✅ | ❌ | ❌ | ❌ | ❌ |
| LLM judge | ✅ | ✅ | ✅ | ✅ | |
| Deterministic scorers | ✅ | ✅ | ✅ | ||
| Config format | Go code | YAML | Python | Go code | Go code |