Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
e1ab736
Add τ-bench research and implementation docs for agent benchmarks
cursoragent Jul 27, 2025
aa45814
Add documentation for τ²-bench migration challenges and practical guide
cursoragent Jul 27, 2025
a871eef
Add full τ²-bench implementation for verifiers framework
cursoragent Jul 27, 2025
fa4f4d1
Refactor tau2-bench to centralize logic in env_response
cursoragent Jul 27, 2025
8405d23
Checkpoint before follow-up message
cursoragent Jul 27, 2025
35dcc1b
Checkpoint before follow-up message
cursoragent Jul 27, 2025
c3944fa
Checkpoint before follow-up message
cursoragent Jul 27, 2025
73d3d10
Simplify Tau2BenchEnv initialization and remove unnecessary kwargs
cursoragent Jul 27, 2025
7cced23
Enhance tau2_bench env with error tracking and improved task evaluation
cursoragent Jul 27, 2025
03cb3ae
Checkpoint before follow-up message
cursoragent Jul 27, 2025
5c6dd0d
Refactor Tau2BenchEnv to improve user simulation and message handling
cursoragent Jul 27, 2025
ab2ec0b
Fix τ²-bench message format and add implementation documentation
cursoragent Jul 27, 2025
b130419
Checkpoint before follow-up message
cursoragent Jul 27, 2025
eb3433f
Checkpoint before follow-up message
cursoragent Jul 27, 2025
7b30c2f
Checkpoint before follow-up message
cursoragent Jul 27, 2025
d9a76a7
Checkpoint before follow-up message
cursoragent Jul 27, 2025
bdc7608
Update tau2_bench installation command in README
cursoragent Jul 27, 2025
e80a0be
Remove tau-bench subprojects
cursoragent Jul 27, 2025
3b5f505
Add automatic tau2-bench data setup when environment is loaded
cursoragent Jul 27, 2025
c2a94a7
Refactor tau2 benchmark env with exact evaluation logic and task conv…
cursoragent Jul 27, 2025
fb8ec65
Checkpoint before follow-up message
cursoragent Jul 27, 2025
840b8b6
Improve tool handling and JSON serialization in Tau2BenchEnv
cursoragent Jul 27, 2025
4458ea3
Fix oai_tools serialization in Tau2BenchEnv dataset handling
cursoragent Jul 27, 2025
decbd6d
Fix HuggingFace Dataset schema issues with oai_tools serialization
cursoragent Jul 27, 2025
f015b9d
Fix tool call handling to prevent premature user response generation
cursoragent Jul 27, 2025
aff1fbb
Add PROJECT_STATUS.md for tau2-bench implementation details
cursoragent Jul 27, 2025
f763e60
Checkpoint before follow-up message
cursoragent Jul 27, 2025
013e978
Checkpoint before follow-up message
cursoragent Jul 27, 2025
c7fb5df
Fix tau2-bench tool execution and error handling
cursoragent Jul 27, 2025
215a00e
Checkpoint before follow-up message
cursoragent Jul 27, 2025
a2770d4
Refactor tau2 environment handling in Tau2BenchEnv to improve state m…
cursoragent Jul 27, 2025
d1eddae
Checkpoint before follow-up message
cursoragent Jul 27, 2025
0b65f88
Checkpoint before follow-up message
cursoragent Jul 27, 2025
c575b2a
Fix import handling and debug tau2 import errors
cursoragent Jul 27, 2025
b23f48a
Improve error handling and tool message formatting in Tau2BenchEnv
cursoragent Jul 27, 2025
edfee2c
Fix tool execution flow by removing redundant else blocks and adding …
cursoragent Jul 27, 2025
3b52a08
Checkpoint before follow-up message
cursoragent Jul 27, 2025
854ef36
Add tool call/message tracking and validation in tau2_bench
cursoragent Jul 27, 2025
ebe317b
Checkpoint before follow-up message
cursoragent Jul 27, 2025
359cd4a
Refactor tau2-bench environment to use native tau2 functions
cursoragent Jul 27, 2025
a4bbc89
Remove PROJECT_STATUS.md and update README with current implementatio…
cursoragent Jul 27, 2025
ed45440
remove log
willccbb Jul 27, 2025
5d8efd8
Checkpoint before follow-up message
cursoragent Jul 28, 2025
857e838
Refactor tau2_bench environment with improved tool execution and solo…
cursoragent Jul 28, 2025
c92e708
Checkpoint before follow-up message
cursoragent Jul 28, 2025
3e9a217
Update project metadata and dependencies in pyproject.toml
cursoragent Jul 28, 2025
f6c022f
Remove tau2-bench subproject
cursoragent Jul 28, 2025
8dce47e
Remove tau2_bench environment initialization module
cursoragent Jul 28, 2025
5052d16
Checkpoint before follow-up message
cursoragent Jul 28, 2025
e5e7254
Checkpoint before follow-up message
cursoragent Jul 28, 2025
20366e2
Refactor tau2 tool execution to use get_response method directly
cursoragent Jul 28, 2025
070d25b
Refactor Tau2BenchEnv state initialization with improved environment …
cursoragent Jul 28, 2025
edf5f99
Refactor database tracking with hash-based state monitoring
cursoragent Jul 28, 2025
fcdaddb
Checkpoint before follow-up message
cursoragent Jul 28, 2025
58550ad
Preserve tool call details when converting tau2 messages to verifiers…
cursoragent Jul 28, 2025
6b50427
Refactor Tau2BenchEnv to dynamically load environment config by domain
cursoragent Jul 28, 2025
71079e6
Checkpoint before follow-up message
cursoragent Jul 28, 2025
83e7eba
Checkpoint before follow-up message
cursoragent Jul 28, 2025
ad4a0df
Checkpoint before follow-up message
cursoragent Jul 28, 2025
da9300e
Refactor conversation termination logic using tau2's methods
cursoragent Jul 28, 2025
cb44b57
Simplify tau2_bench special response handling and rubric evaluation
cursoragent Jul 28, 2025
8efe255
Remove unused imports from tau2_bench.py
cursoragent Jul 28, 2025
f4c6351
Refactor tau2 message conversion for clearer, more robust simulation …
cursoragent Jul 28, 2025
95d7b14
Checkpoint before follow-up message
cursoragent Jul 28, 2025
d13d848
Simplify tool call parsing in tau2 message conversion
cursoragent Jul 28, 2025
c55d11a
Improve tool call parsing for multiple input formats in Tau2BenchEnv
cursoragent Jul 28, 2025
fd4764e
Refactor tool call handling and simplify comments in Tau2BenchEnv
cursoragent Jul 28, 2025
eed452a
Checkpoint before follow-up message
cursoragent Jul 28, 2025
6f02d48
Checkpoint before follow-up message
cursoragent Jul 28, 2025
bf65fe2
Update Tau2BenchEnv with tau2 default max_steps and max_errors
cursoragent Jul 28, 2025
39263bf
Remove unused max_turns parameter from Tau2BenchEnv
cursoragent Jul 28, 2025
00023e3
Add debug logging for tool calls, errors, and task processing
cursoragent Jul 28, 2025
b65b7b7
Add debug logging and improve tool message handling in Tau2BenchEnv
cursoragent Jul 28, 2025
b45885e
Improve scenario handling and add debug logging for tool calls
cursoragent Jul 28, 2025
17087b4
Add debug logging and detailed action matching for tau2 evaluation
cursoragent Jul 28, 2025
4b8a956
Checkpoint before follow-up message
cursoragent Jul 28, 2025
7b8b77a
Add debug logging for task initialization and tool calls
cursoragent Jul 28, 2025
efe42a7
Add debug logging for specific order during return/exchange operations
cursoragent Jul 28, 2025
65f3599
Add detailed debug logging for task initialization and order state
cursoragent Jul 28, 2025
40abfb1
Prevent premature termination when assistant has pending tool calls
cursoragent Jul 28, 2025
7e40a9f
Checkpoint before follow-up message
cursoragent Jul 28, 2025
2e77d97
Checkpoint before follow-up message
cursoragent Jul 28, 2025
aeed82d
Checkpoint before follow-up message
cursoragent Jul 28, 2025
e3a9b5e
Checkpoint before follow-up message
cursoragent Jul 28, 2025
a5a0a74
Checkpoint before follow-up message
cursoragent Jul 28, 2025
751d58a
Prevent premature termination during tool call execution in Tau2BenchEnv
cursoragent Jul 28, 2025
e2b146b
Update Tau2Bench README with comprehensive documentation and status
cursoragent Jul 28, 2025
986bdef
Checkpoint before follow-up message
cursoragent Jul 28, 2025
105051b
Refactor tau2_bench imports and remove unused environment variables
cursoragent Jul 28, 2025
fa6f8ae
Merge pull request #166 from willccbb/cursor/benchmark-review-and-imp…
willccbb Jul 28, 2025
09abecd
Merge branch 'main' into cursor/research-and-select-agent-llm-benchma…
willccbb Aug 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions environments/tau2_bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Tau2Bench Environment

Implementation of [Tau-Bench](https://github.com/sierra-research/tau2-bench) for the verifiers framework.

## Status

⚠️ **Work in Progress**: This implementation is actively being developed to achieve full parity with the original Tau2-Bench logic.

**Current Performance (Retail Domain)**:
- Model: gpt-4.1-mini
- Pass@1: 42.1%
- User model: gpt-4.1
- Note: The original benchmark reports >60% for gpt-4.1-mini


## Overview

Tau2Bench evaluates language models on complex, multi-turn conversations requiring tool use across three domains:
- **Retail**: E-commerce customer service scenarios
- **Airline**: Flight booking and management tasks
- **Telecom**: Telecommunications service support (includes user-side tool calls)

## Installation

```bash
vf-install tau2_bench
```

This will automatically install the tau2-bench package and its dependencies.

## Usage

```bash
# Evaluate on retail domain (default)
vf-eval tau2_bench -n 20

# Evaluate on specific domain
vf-eval tau2_bench -n 20 --env-args '{"domain": "airline"}'

# Evaluate with different user simulator model
vf-eval tau2_bench -n 20 --env-args '{"user_llm": "gpt-4"}'
```

## Environment Arguments

- `domain`: Domain to evaluate on ("retail", "airline", "telecom"). Default: "retail"
- `user_llm`: Model to use for user simulation. Default: "gpt-4.1"
- `solo_mode`: Enable solo mode for telecom domain. Default: False
- `max_steps`: Maximum conversation steps. Default: 200
- `max_errors`: Maximum allowed errors before termination. Default: 10

## Implementation Details

This environment wraps the original tau2-bench implementation, providing:
- Native tau2 tool execution through `tau2_env.get_response()`
- Proper state management and database isolation between rollouts
- User simulation using tau2's UserSimulator
- Original tau2 evaluation metrics via `evaluate_simulation()`

The implementation aims to exactly mirror tau2's behavior while conforming to the verifiers framework patterns.

## Evaluation

The environment uses tau2's native evaluation, which checks:
1. **Task Completion**: Whether the agent successfully completed the requested task
2. **Communication Score**: How well the agent's actions matched expected behavior
3. **Database State**: Whether the final database state matches expectations

## Troubleshooting

If you encounter import errors, ensure tau2-bench is properly installed:
```bash
vf-install tau2-bench
```

For evaluation discrepancies, check that:
- The model has access to all required tools
- The system prompt includes domain-specific policies
- Tool responses are properly formatted

## Contributing

When improving this implementation:
1. Ensure all changes maintain compatibility with tau2's evaluation logic
2. Test against multiple domains and edge cases
3. Verify database state isolation between parallel rollouts
4. Keep tool execution logic aligned with tau2's original implementation
19 changes: 19 additions & 0 deletions environments/tau2_bench/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[project]
name = "tau2-bench"
version = "0.1.0"
dependencies = [
"pydantic>=2.0.0",
"datasets>=2.0.0",
"verifiers>=0.1.2",
"tau2 @ git+https://github.com/sierra-research/tau2-bench.git",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build]
include = ["tau2_bench.py"]

[tool.hatch.metadata]
allow-direct-references = true
Loading
Loading