Rubrics

Rubrics are defined with assert entries and support binary checklist grading and score-range analytic grading.

Basic Usage

The simplest form: list plain strings in assert and each one becomes a required criterion:

tests:
  - id: quicksort-explain
    criteria: Explain how quicksort works
    input: Explain quicksort algorithm
    assert:
      - Mentions divide-and-conquer approach
      - Explains partition step
      - States time complexity

All strings are collected into a single llm-rubric grader automatically.

Full form for advanced options

Use type: llm-rubric explicitly when you need weights, required flags, or score ranges. Put structured rubric items in value so the assertion stays compatible with promptfoo’s llm-rubric.value object/array field:

tests:
  - id: quicksort-explain
    criteria: Explain how quicksort works
    input: Explain quicksort algorithm
    assert:
      - type: llm-rubric
        value:
          - Mentions divide-and-conquer approach
          - Explains partition step
          - States time complexity

Checklist Mode

For fine-grained control, use rubric objects with weights and requirements:

assert:
  - type: llm-rubric
    value:
      - id: core-concept
        outcome: Explains divide-and-conquer
        weight: 2.0
        required: true
      - id: partition
        outcome: Describes partition step
        weight: 1.5
      - id: complexity
        outcome: States O(n log n) average time
        weight: 1.0

Rubric Object Fields

Field	Default	Description
`id`	Auto-generated	Unique identifier for the criterion
`outcome`	—	Description of what to check
`operator`	—	Optional intent hint: `correctness` or `contradiction`
`weight`	`1.0`	Relative importance for scoring
`required`	`true`	If true, failing this criterion fails the entire eval
`min_score`	—	Minimum score (0–1) for this criterion to pass
`score_ranges`	—	Score range definitions (analytic mode)

weight controls score contribution. required controls the verdict: a required checklist criterion can fail the eval even when the weighted score would otherwise pass.

llm-rubric.value can also be free-form text or an arbitrary JSON/YAML object. AgentV uses structured scoring when value is an array of rubric strings or rubric objects with fields such as outcome and score_ranges; other objects are passed to the rubric prompt as rubric data.

To replace AgentV’s built-in rubric judge prompt for a suite, set a default prompt override:

default_test:
  options:
    rubric_prompt: file://graders/judge-prompt.txt

For shared defaults, put the partial default test in a separate file:

options:
  rubric_prompt: file://graders/judge-prompt.txt

Then reference it from each eval:

default_test: file://{{ env.AGENTV_REPO_ROOT }}/.agentv/default-test.yaml

AgentV makes AGENTV_REPO_ROOT available while loading eval and config files. If a project wants a shorter name, add a reference in .agentv/config.yaml. Reference names are project-defined; global-default is just an example:

refs:
  global-default: file://{{ env.AGENTV_REPO_ROOT }}/.agentv/default-test.yaml

Then use:

default_test: ref://global-default

Most evals do not need this. Plain strings and structured value arrays use the built-in prompt automatically.

Criterion Operators

Use operator when the criterion outcome should be interpreted with a specific grading intent instead of relying on the wording in outcome.

assert:
  - type: llm-rubric
    value:
      - id: supported-revenue
        operator: correctness
        outcome: States revenue increased to $10M
        required: true
      - id: no-revenue-conflict
        operator: contradiction
        outcome: Revenue increased to $10M
        required: true

correctness requires the answer to positively satisfy the outcome. contradiction is a guard: the answer passes when it does not make an incompatible claim, even if it omits the outcome entirely.

Score-Range Mode (Analytic)

For quality gradients instead of binary pass/fail, use score ranges:

assert:
  - type: llm-rubric
    value:
      - id: accuracy
        outcome: Provides correct answer
        weight: 2.0
        score_ranges:
          0: Completely wrong
          3: Partially correct with major errors
          5: Mostly correct with minor issues
          7: Correct with minor omissions
          10: Perfectly accurate and complete

Each criterion is scored 0–10 by the LLM grader with granular feedback.

Scoring

Checklist Mode

score = sum(satisfied_weights) / sum(total_weights)

Score-Range Mode

score = sum(criterion_score / 10 * weight) / sum(total_weights)

Verdicts

Verdict	Score
`pass`	≥ 0.8
`fail`	< 0.8

Authoring Rubrics

Write rubric criteria directly in assert. If you want help choosing between plain rubric strings, deterministic graders, and LLM-based grading, use the agentv-eval-writer skill. Keep the grader choice driven by the criteria rather than one fixed recipe.

Context Available to Rubric Graders

Rubric checks automatically receive the full evaluation context, not just the agent’s text answer. When present, the following are appended to the grader prompt:

file_changes — unified diff of workspace file changes (when workspace is configured)
tool_calls — formatted summary of tool calls from agent execution (tool name + key inputs)

This means rubric criteria can reason about what the agent did, not only what it said. For example, you can check whether an agent invoked a specific skill:

assert:
  - The agent invoked the acme-deploy skill
  - The agent used Read to inspect the config file before editing

This is a lightweight alternative to the skill-trigger evaluator when you want to check tool usage with natural-language criteria.

Combining with Other Graders

Rubrics work alongside script and LLM graders:

tests:
  - id: code-quality
    criteria: Generates correct, clean Python code
    input: Write a fibonacci function
    assert:
      - type: llm-rubric
        value:
          - Returns correct values for n=0,1,2,10
          - Uses meaningful variable names
          - Includes docstring
      - name: syntax_check
        type: script
        command: [./validators/check_python.py]