[DRAFT] Port Tau2-Bench eval to verifiers #163

willccbb · 2025-07-27T23:11:12Z

%23%23 Description
This PR introduces the τ²-bench (tau2-bench) environment to the Verifiers framework, aiming for a verbatim translation of its logic, including dual-control agent and user tool execution, and its strict evaluation criteria.

The primary goal is to accurately mirror the original benchmark's behavior for scoring, prompts, and data flow, ensuring correctness rather than mere similarity.

%23%23 Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

%23%23 Testing

All existing tests pass (implied by vf-eval runs without framework errors)
New tests have been added to cover the changes (implicit through vf-eval runs)
Tests have been run locally with python -m pytest tests/ (implicit through vf-eval runs)

%23%23%23 Test Coverage

Current coverage: N/A
Coverage after changes: N/A

%23%23 Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

%23%23 Additional Notes
This implementation strives for exact parity with the original τ²-bench logic. Key aspects addressed include:

Dual-Control Environment: Both agent and user can execute tools, with state properly managed per-rollout.
Strict Evaluation Logic: The environment now uses τ²-bench's native evaluation functions, including action matching, environment state comparison, and communication checks, returning binary pass/fail scores.
Message Format Compatibility: Resolved subtle issues with tool_calls (requiring None for no calls) and precise error message formatting to match τ²-bench's expectations.
Tool Description Provisioning: Tools are provided to the agent in OpenAI function calling format, mirroring the official benchmark's approach.
Robust Data Handling: A global change in verifiers/envs/environment.py now automatically deserializes oai_tools if stored as a JSON string in the dataset's info column, preventing HuggingFace Dataset schema inference issues for nested structures. This avoids environment-specific

Co-authored-by: williambrown97 <[email protected]>

…ersion Co-authored-by: williambrown97 <[email protected]>

Co-authored-by: williambrown97 <[email protected]>

…rovement-with-gpt-4-1-mini-34c4 Benchmark review and improvement with gpt-4.1-mini

zachariejmartin

Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.

zachariejmartin · 2025-07-29T16:57:20Z

environments/tau2_bench/tau2_bench.py

+        try:
+            # Create a SimulationRun object from our state and messages
+            termination_reason = state.get("termination_reason", "")
+            if termination_reason == "too_many_errors":


Is this supposed to be max_errors?

willccbb · 2025-07-29T17:10:56Z

Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.

I've only tested for inference, but anything that follows the vf.Environment spec should automatically be trainable with GRPOTrainer (success will vary depending on base model + config parameters of course).

zachariejmartin · 2025-07-29T17:33:11Z

Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.

I've only tested for inference, but anything that follows the vf.Environment spec should automatically be trainable with GRPOTrainer (success will vary depending on base model + config parameters of course).

Since we are passing tools= to the chat completion endpoint, don't we rely on vllm to do parsing now in a way that is different than the other envs?

…rks-for-migration-e4c4

willccbb · 2025-08-11T23:29:17Z

@zachariejmartin sorry for delay -- ToolEnv was updated to use built-in tool parsing by default, the manual XML parsing was always a bit of a hack, and didn't support some desirable features like system prompt auto-formatting via chat templates and parallel tool calling.

this has been on backburner a bit, but revisiting this week

zachariejmartin · 2025-08-12T11:03:32Z

@willccbb makes sense. anything I can help with?

cursoragent and others added 30 commits July 27, 2025 00:42

Add τ-bench research and implementation docs for agent benchmarks

e1ab736

Co-authored-by: williambrown97 <[email protected]>

Add documentation for τ²-bench migration challenges and practical guide

aa45814

Co-authored-by: williambrown97 <[email protected]>

Add full τ²-bench implementation for verifiers framework

a871eef

Co-authored-by: williambrown97 <[email protected]>

Refactor tau2-bench to centralize logic in env_response

fa4f4d1

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

8405d23

Checkpoint before follow-up message

35dcc1b

Checkpoint before follow-up message

c3944fa

Simplify Tau2BenchEnv initialization and remove unnecessary kwargs

73d3d10

Co-authored-by: williambrown97 <[email protected]>

Enhance tau2_bench env with error tracking and improved task evaluation

7cced23

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

03cb3ae

Refactor Tau2BenchEnv to improve user simulation and message handling

5c6dd0d

Co-authored-by: williambrown97 <[email protected]>

Fix τ²-bench message format and add implementation documentation

ab2ec0b

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

b130419

Checkpoint before follow-up message

eb3433f

Checkpoint before follow-up message

7b30c2f

Checkpoint before follow-up message

d9a76a7

Update tau2_bench installation command in README

bdc7608

Co-authored-by: williambrown97 <[email protected]>

Remove tau-bench subprojects

e80a0be

Co-authored-by: williambrown97 <[email protected]>

Add automatic tau2-bench data setup when environment is loaded

3b5f505

Co-authored-by: williambrown97 <[email protected]>

Refactor tau2 benchmark env with exact evaluation logic and task conv…

c2a94a7

…ersion Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

fb8ec65

Improve tool handling and JSON serialization in Tau2BenchEnv

840b8b6

Co-authored-by: williambrown97 <[email protected]>

Fix oai_tools serialization in Tau2BenchEnv dataset handling

4458ea3

Co-authored-by: williambrown97 <[email protected]>

Fix HuggingFace Dataset schema issues with oai_tools serialization

decbd6d

Co-authored-by: williambrown97 <[email protected]>

Fix tool call handling to prevent premature user response generation

f015b9d

Co-authored-by: williambrown97 <[email protected]>

Add PROJECT_STATUS.md for tau2-bench implementation details

aff1fbb

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

f763e60

Checkpoint before follow-up message

013e978

Fix tau2-bench tool execution and error handling

c7fb5df

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

215a00e

cursoragent and others added 22 commits July 28, 2025 02:24

Checkpoint before follow-up message

6f02d48

Update Tau2BenchEnv with tau2 default max_steps and max_errors

bf65fe2

Co-authored-by: williambrown97 <[email protected]>

Remove unused max_turns parameter from Tau2BenchEnv

39263bf

Co-authored-by: williambrown97 <[email protected]>

Add debug logging for tool calls, errors, and task processing

00023e3

Co-authored-by: williambrown97 <[email protected]>

Add debug logging and improve tool message handling in Tau2BenchEnv

b65b7b7

Co-authored-by: williambrown97 <[email protected]>

Improve scenario handling and add debug logging for tool calls

b45885e

Co-authored-by: williambrown97 <[email protected]>

Add debug logging and detailed action matching for tau2 evaluation

17087b4

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

4b8a956

Add debug logging for task initialization and tool calls

7b8b77a

Co-authored-by: williambrown97 <[email protected]>

Add debug logging for specific order during return/exchange operations

efe42a7

Co-authored-by: williambrown97 <[email protected]>

Add detailed debug logging for task initialization and order state

65f3599

Co-authored-by: williambrown97 <[email protected]>

Prevent premature termination when assistant has pending tool calls

40abfb1

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

7e40a9f

Checkpoint before follow-up message

2e77d97

Checkpoint before follow-up message

aeed82d

Checkpoint before follow-up message

e3a9b5e

Checkpoint before follow-up message

a5a0a74

Prevent premature termination during tool call execution in Tau2BenchEnv

751d58a

Co-authored-by: williambrown97 <[email protected]>

Update Tau2Bench README with comprehensive documentation and status

e2b146b

Co-authored-by: williambrown97 <[email protected]>

Checkpoint before follow-up message

986bdef

Refactor tau2_bench imports and remove unused environment variables

105051b

Co-authored-by: williambrown97 <[email protected]>

Merge pull request #166 from willccbb/cursor/benchmark-review-and-imp…

fa6f8ae

…rovement-with-gpt-4-1-mini-34c4 Benchmark review and improvement with gpt-4.1-mini

willccbb mentioned this pull request Jul 28, 2025

Proposal: Add τ-Bench Integration (interactive multi-LLM tasks) to verifiers #133

Open

zachariejmartin reviewed Jul 29, 2025

View reviewed changes

Merge branch 'main' into cursor/research-and-select-agent-llm-benchma…

09abecd

…rks-for-migration-e4c4

willccbb marked this pull request as draft August 22, 2025 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] Port Tau2-Bench eval to verifiers #163

[DRAFT] Port Tau2-Bench eval to verifiers #163

Uh oh!

willccbb commented Jul 27, 2025

Uh oh!

zachariejmartin left a comment

Uh oh!

zachariejmartin Jul 29, 2025

Uh oh!

willccbb commented Jul 29, 2025

Uh oh!

zachariejmartin commented Jul 29, 2025

Uh oh!

willccbb commented Aug 11, 2025

Uh oh!

zachariejmartin commented Aug 12, 2025

Uh oh!

Uh oh!

[DRAFT] Port Tau2-Bench eval to verifiers #163

Are you sure you want to change the base?

[DRAFT] Port Tau2-Bench eval to verifiers #163

Uh oh!

Conversation

willccbb commented Jul 27, 2025

Uh oh!

zachariejmartin left a comment

Choose a reason for hiding this comment

Uh oh!

zachariejmartin Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

willccbb commented Jul 29, 2025

Uh oh!

zachariejmartin commented Jul 29, 2025

Uh oh!

willccbb commented Aug 11, 2025

Uh oh!

zachariejmartin commented Aug 12, 2025

Uh oh!

Uh oh!