Skip to content

Releases: willccbb/verifiers

v0.1.3

26 Aug 11:56
Compare
Choose a tag to compare

Verifiers v0.1.3 Release Notes

Date: 8/26/25

Verifiers v0.1.3 adds a number of features for expanded functionality and ease of use, along with additional library integrations and bug fixes.

Highlights

  • We now have a TUI! 🎉 Run vf-tui to interactively browse all locally-saved evaluation results in your terminal.
  • Overhauled logging for vf-eval evaluation results with tagged JSON artifact folders.
    • Defaults to saving in your environment's project directory under outputs/ if developing locally; ./outputs if using an environment installed from elsewhere.
    • The short-lived Markdown report outputs are now deprecated.
  • Multimodal-input tasks are supported for evaluation (see environments/mmmu for an example)! Official trainer support in verifiers is pending, and can be accessed via HUD's hud-vf-gym project.
  • Optional async for reward functions, tools, and Environment class methods
    • maybe_await pattern for safe accommodation of both sync and async functions
    • Sync extensions of env_response and is_completed in MultiTurnEnv will work, but with a type warning; users are encouraged to migrate these functions to async for ongoing usage.
  • Full JSON sampling args in vf-eval via -S (#240).
  • Official community examples library under very active development: prime-environments
  • Native init/push/pull/install support in prime-cli (and more...)
    • Run uv tool install prime for a preview 🙂
  • Feature-complete support for training and online evaluations in prime-rl.
  • Improved caching and parallelization for JudgeRubric.
  • Rubric.class_objects values are available to all reward functions by key name.
  • Bug fixes for tool call sanitization and saving datasets to Huggingface
  • Improvements to documentation.
  • From the recent 0.1.2.post1 pre-release version:
    • StatefulToolEnv for intercepting function calls for routing and state management (#224)
    • Improved lazy imports for efficient evaluation.
    • Overhauled MathRubric for math-verify as default reward.
    • Full support restored for completions generation (#201, #196).
  • New required dependencies since 0.1.2: rich, textual, jinja.

Thanks to everyone who contributed to this release!

Stay tuned for some big announcements in the coming days 😊

Full Changelog: v0.1.2...v0.1.3

v0.1.2.post1

23 Aug 08:17
Compare
Choose a tag to compare

Verifiers v0.1.2.post1 – Release Notes

Incremental update focused on a new stateful tool environment, environment folder cleanup/renaming, math verification robustness, reporting improvements, and bug fixes.

Highlights

  • Stateful tools: add a stateful tool environment and move tool JSON loading into environment responses (PR #224).
  • Environments: consolidation/renames for clarity and new environment tags (PR #222 and related changes).
  • Lazy imports: training-related libraries are only imported when accessed
  • Verification: more robust default math verification (PR #213).
  • RL support: enable base-model RL with message_type="completions" (PR #201), plus Prime-RL integration and docs (PR #204) and GRPO trainer updates (PR #217, #218).
  • Reporting & endpoints: template/report tweaks and endpoint path loading improvements (PR #206, PR #203, plus follow-ups).
  • CLI/UX: make rich a default dependency for the eval script (PR #200); eval output refinements.
  • Fixes: hotfix for sampling args for gpt-5.

Changes by Area

CLI and Scripts

  • vf-eval
    • Add rich as a default dependency to improve output readability (PR #200).
    • Refine eval outputs and result handling (PR #223 and related commits).
  • Hotfixes
    • Update sampling args for gpt-5 (hotfix commit).

Environments and Examples

  • Add a stateful tool environment; load tool information via environment responses (PR #224).
  • Rename and consolidate environments, introduce tag metadata for discoverability (PR #222; additional env tag updates).
  • Math environment updates and prompt tweaks.
  • Remove dead processing code in environment.py; general cleanup and type hint improvements.

Parsers, Rubrics, and Utils

  • Caching improvements for JudgeRubric to reduce redundant work (PR #216).
  • More robust rule-based math verification and heuristics (PR #213).
  • General type-hint and internal cleanup passes.

Training

  • Document Prime-RL training (PR #204).
  • Minor updates to GRPO trainer (PR #217, #218).
  • Add support for base-model RL flows via message_type="completions" (PR #201).

Reporting and Tooling

  • Report generation and template tweaks (PR #206, PR #203).
  • Improve endpoint path loading and related tooling.

Documentation

  • README and docs updates (minor) across environments and training utilities; additional guidance for reporting.

Upgrade Notes

  • Environment renames/tags: if you reference environment names or use tags in tooling or scripts, review the updated names and tag metadata (PR #222).

Reference Commits (since v0.1.2.post0)

  • adding stateful toolenv, moving tool json loading to env_response (PR #224)
  • Will/eval outputs (PR #223)
  • Update grpo_trainer.py (PR #217, PR #218)
  • hotfix for gpt-5 sampling args
  • Will/rename envs (PR #222)
  • Will/judgerubric caching (PR #216)
  • More robust rule-based math verification (PR #213)
  • Report tweaks and endpoints path loading (PR #206 and follow-ups)
  • Integrate and document prime-rl training (PR #204)
  • Update report generation and vf-init template (PR #203)
  • Add support for base model RL / message_type="completions" (PR #201)
  • Add rich as default dependency for eval script (PR #200)
  • Math env updates, prompt tweaks, type hints, and cleanup in environment.py

Full Changelog

v0.1.2.post0

09 Aug 00:27
Compare
Choose a tag to compare

Verifiers v0.1.2.post0 – Release Notes

Minor post-release update focusing on polish: CLI script bug fixes and enhancements, environment example cleanup, better reporting, and improved test coverage.

Highlights

  • vf-eval: fixed rollout indexing bugs and improved reliability when sampling multiple rollouts.
  • vf-init: streamlined project initialization and naming (removed automatic vf- prefix) and refreshed templates.
  • Environments: documentation and prompt cleanups; added/updated AIME examples; improved report embedding.
  • Tests: expanded coverage across rubric behavior, XML parser, and environment edge cases.

Changes by Area

CLI and Scripts

  • vf-eval
    • Fix index handling when using multiple rollouts (PR #197).
    • Ensure metrics columns are included in generated datasets via supporting utilities (PR #194).
  • vf-init
    • Remove automatic vf- prefix during init to honor provided names (PR #190).
    • Update README template/content for new environments (multiple small tweaks).

Environments and Examples

  • AIME 2024 / AIME 2025 updates (PR #199).
  • Math Python example: prompt/readme/report cleanups.
  • General environment cleanup and README refreshes across multiple examples.
  • HotpotQA example: troubleshooting notes and minor fixes.

Parsers, Rubrics, and Utils

  • XMLParser: fix handling of string completions during parse_answer (PR #196).
  • Rubric: ensure error-handling behavior is well-covered by tests (PR #195).
  • Reporting: improvements to report generation/embedding (report_utils).
  • Dataset helpers: include metrics columns in outputs where expected (PR #194).

Tests

  • Increase test coverage for:
    • Rubric error handling (PR #195).
    • XML parser behavior (new tests).
    • Environment edge cases and extra scenarios.

Acknowledgements

Thank you to everyone who contributed to this minor release:

If we missed anyone, thank you as well—your contributions are appreciated.

Upgrade Notes

  • No breaking API changes.
  • When initializing a new environment with vf-init, note the name is now used verbatim (no automatic vf- prefix, PR #190).

Reference Commits (since v0.1.2)

  • Fix XMLParser string completion parsing (PR #196)
  • Improve test coverage for Rubric error handling (PR #195)
  • Include metrics columns in dataset outputs (PR #194)
  • Fix vf-eval rollout index handling (PR #197)
  • Remove automatic vf- prefix from init (PR #190)
  • AIME 2024 / 2025 environments updates (PR #199)
  • Environment README/reporting cleanups and misc improvements

Full Changelog

v0.1.2

31 Jul 02:34
Compare
Choose a tag to compare

What's changed

With the v0.1.2 release, verifiers is significantly more production-ready, and stable to build and train with. We appreciate everyone's patience with the changes and bug fixes thus far as we've addressed a number of long-time requests, and are excited to see what you all build with it!

Highlights:

  • Proper encapsulation of Environments as standalone modules (see environments/), which can contain their own dependencies in a pyproject.toml, and need only to expose a load_environment(...) -> vf.Environment function in order to be trainable.
  • Script flows for initializing (vf-init), installing (vf-install), and evaluating (vf-eval) Environments before training.
  • Reorganization of examples and training scripts, removing lots of duplicated logic and creating a cleaner separation between library code and example code.
  • Deprecation of the manual dynamically-batched LLM inference worker in favor of proper AsyncLLM support, allowing full control of native vLLM sampling parameters.
  • Support for native tool call parsing + parallel tool calls in ToolEnv (replacing the manual XMLParser approach).
  • Another trainer! Environments built with verifiers are now trainable with prime-rl (as of 58ac91f for v0.1.2), which supports multi-node FSDP async training, is the primary RL framework used by the Prime Intellect research team, and is under ongoing development and stress-testing in advance of large-scale multi-environment training runs.
  • Pydantic types for core data classes used by Environments.
  • Improvements to GRPOTrainer, including supporting a single max_seq_len option (instead of separate prompt + completion lengths), and configurable turn length limits via max_tokens.
  • Many more Environment examples.
  • Improved logging and evaluation options.
  • Overhauled README.md and docs.