Skip to content

Rubrics

Rubrics are defined with assert entries and support binary checklist grading and score-range analytic grading.

The simplest form: list plain strings in assert and each one becomes a required criterion:

tests:
- id: quicksort-explain
criteria: Explain how quicksort works
input: Explain quicksort algorithm
assert:
- Mentions divide-and-conquer approach
- Explains partition step
- States time complexity

All strings are collected into a single llm-rubric grader automatically.

Use type: llm-rubric explicitly when you need weights, required flags, or score ranges. Put structured rubric items in value so the assertion stays compatible with promptfoo’s llm-rubric.value object/array field:

tests:
- id: quicksort-explain
criteria: Explain how quicksort works
input: Explain quicksort algorithm
assert:
- type: llm-rubric
value:
- Mentions divide-and-conquer approach
- Explains partition step
- States time complexity

For fine-grained control, use rubric objects with weights and requirements:

assert:
- type: llm-rubric
value:
- id: core-concept
outcome: Explains divide-and-conquer
weight: 2.0
required: true
- id: partition
outcome: Describes partition step
weight: 1.5
- id: complexity
outcome: States O(n log n) average time
weight: 1.0
FieldDefaultDescription
idAuto-generatedUnique identifier for the criterion
outcomeDescription of what to check
operatorOptional intent hint: correctness or contradiction
weight1.0Relative importance for scoring
requiredtrueIf true, failing this criterion fails the entire eval
min_scoreMinimum score (0–1) for this criterion to pass
score_rangesScore range definitions (analytic mode)

weight controls score contribution. required controls the verdict: a required checklist criterion can fail the eval even when the weighted score would otherwise pass.

llm-rubric.value can also be free-form text or an arbitrary JSON/YAML object. AgentV uses structured scoring when value is an array of rubric strings or rubric objects with fields such as outcome and score_ranges; other objects are passed to the rubric prompt as rubric data.

To replace AgentV’s built-in rubric judge prompt for a suite, set a default prompt override:

default_test:
options:
rubric_prompt: file://graders/judge-prompt.txt

For shared defaults, put the partial default test in a separate file:

.agentv/default-test.yaml
options:
rubric_prompt: file://graders/judge-prompt.txt

Then reference it from each eval:

default_test: file://{{ env.AGENTV_REPO_ROOT }}/.agentv/default-test.yaml

AgentV makes AGENTV_REPO_ROOT available while loading eval and config files. If a project wants a shorter name, add a reference in .agentv/config.yaml. Reference names are project-defined; global-default is just an example:

refs:
global-default: file://{{ env.AGENTV_REPO_ROOT }}/.agentv/default-test.yaml

Then use:

default_test: ref://global-default

Most evals do not need this. Plain strings and structured value arrays use the built-in prompt automatically.

Use operator when the criterion outcome should be interpreted with a specific grading intent instead of relying on the wording in outcome.

assert:
- type: llm-rubric
value:
- id: supported-revenue
operator: correctness
outcome: States revenue increased to $10M
required: true
- id: no-revenue-conflict
operator: contradiction
outcome: Revenue increased to $10M
required: true

correctness requires the answer to positively satisfy the outcome. contradiction is a guard: the answer passes when it does not make an incompatible claim, even if it omits the outcome entirely.

For quality gradients instead of binary pass/fail, use score ranges:

assert:
- type: llm-rubric
value:
- id: accuracy
outcome: Provides correct answer
weight: 2.0
score_ranges:
0: Completely wrong
3: Partially correct with major errors
5: Mostly correct with minor issues
7: Correct with minor omissions
10: Perfectly accurate and complete

Each criterion is scored 0–10 by the LLM grader with granular feedback.

score = sum(satisfied_weights) / sum(total_weights)
score = sum(criterion_score / 10 * weight) / sum(total_weights)
VerdictScore
pass≥ 0.8
fail< 0.8

Write rubric criteria directly in assert. If you want help choosing between plain rubric strings, deterministic graders, and LLM-based grading, use the agentv-eval-writer skill. Keep the grader choice driven by the criteria rather than one fixed recipe.

Rubric checks automatically receive the full evaluation context, not just the agent’s text answer. When present, the following are appended to the grader prompt:

  • file_changes — unified diff of workspace file changes (when workspace is configured)
  • tool_calls — formatted summary of tool calls from agent execution (tool name + key inputs)

This means rubric criteria can reason about what the agent did, not only what it said. For example, you can check whether an agent invoked a specific skill:

assert:
- The agent invoked the acme-deploy skill
- The agent used Read to inspect the config file before editing

This is a lightweight alternative to the skill-trigger evaluator when you want to check tool usage with natural-language criteria.

Rubrics work alongside script and LLM graders:

tests:
- id: code-quality
criteria: Generates correct, clean Python code
input: Write a fibonacci function
assert:
- type: llm-rubric
value:
- Returns correct values for n=0,1,2,10
- Uses meaningful variable names
- Includes docstring
- name: syntax_check
type: script
command: [./validators/check_python.py]