Skip to content

Conversation

willccbb
Copy link
Owner

%23%23 Description
This PR introduces the τ²-bench (tau2-bench) environment to the Verifiers framework, aiming for a verbatim translation of its logic, including dual-control agent and user tool execution, and its strict evaluation criteria.

The primary goal is to accurately mirror the original benchmark's behavior for scoring, prompts, and data flow, ensuring correctness rather than mere similarity.

%23%23 Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

%23%23 Testing

  • All existing tests pass (implied by vf-eval runs without framework errors)
  • New tests have been added to cover the changes (implicit through vf-eval runs)
  • Tests have been run locally with python -m pytest tests/ (implicit through vf-eval runs)

%23%23%23 Test Coverage

  • Current coverage: N/A
  • Coverage after changes: N/A

%23%23 Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

%23%23 Additional Notes
This implementation strives for exact parity with the original τ²-bench logic. Key aspects addressed include:

  • Dual-Control Environment: Both agent and user can execute tools, with state properly managed per-rollout.
  • Strict Evaluation Logic: The environment now uses τ²-bench's native evaluation functions, including action matching, environment state comparison, and communication checks, returning binary pass/fail scores.
  • Message Format Compatibility: Resolved subtle issues with tool_calls (requiring None for no calls) and precise error message formatting to match τ²-bench's expectations.
  • Tool Description Provisioning: Tools are provided to the agent in OpenAI function calling format, mirroring the official benchmark's approach.
  • Robust Data Handling: A global change in verifiers/envs/environment.py now automatically deserializes oai_tools if stored as a JSON string in the dataset's info column, preventing HuggingFace Dataset schema inference issues for nested structures. This avoids environment-specific

cursoragent and others added 30 commits July 27, 2025 00:42
Co-authored-by: williambrown97 <[email protected]>
cursoragent and others added 22 commits July 28, 2025 02:24
…rovement-with-gpt-4-1-mini-34c4

Benchmark review and improvement with gpt-4.1-mini
Copy link

@zachariejmartin zachariejmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.

try:
# Create a SimulationRun object from our state and messages
termination_reason = state.get("termination_reason", "")
if termination_reason == "too_many_errors":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be max_errors?

@willccbb
Copy link
Owner Author

Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.

I've only tested for inference, but anything that follows the vf.Environment spec should automatically be trainable with GRPOTrainer (success will vary depending on base model + config parameters of course).

@zachariejmartin
Copy link

Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.

I've only tested for inference, but anything that follows the vf.Environment spec should automatically be trainable with GRPOTrainer (success will vary depending on base model + config parameters of course).

Since we are passing tools= to the chat completion endpoint, don't we rely on vllm to do parsing now in a way that is different than the other envs?

@willccbb
Copy link
Owner Author

@zachariejmartin sorry for delay -- ToolEnv was updated to use built-in tool parsing by default, the manual XML parsing was always a bit of a hack, and didn't support some desirable features like system prompt auto-formatting via chat templates and parallel tool calling.

this has been on backburner a bit, but revisiting this week

@zachariejmartin
Copy link

@willccbb makes sense. anything I can help with?

@willccbb willccbb marked this pull request as draft August 22, 2025 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants