-
Notifications
You must be signed in to change notification settings - Fork 295
[DRAFT] Port Tau2-Bench eval to verifiers #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DRAFT] Port Tau2-Bench eval to verifiers #163
Conversation
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
…ersion Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
Co-authored-by: williambrown97 <[email protected]>
…rovement-with-gpt-4-1-mini-34c4 Benchmark review and improvement with gpt-4.1-mini
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about how to integrate training? I can put a PR up, there may be some useful ideas in it. It's based on \tau^1 though.
try: | ||
# Create a SimulationRun object from our state and messages | ||
termination_reason = state.get("termination_reason", "") | ||
if termination_reason == "too_many_errors": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be max_errors?
I've only tested for inference, but anything that follows the |
Since we are passing tools= to the chat completion endpoint, don't we rely on vllm to do parsing now in a way that is different than the other envs? |
…rks-for-migration-e4c4
@zachariejmartin sorry for delay -- ToolEnv was updated to use built-in tool parsing by default, the manual XML parsing was always a bit of a hack, and didn't support some desirable features like system prompt auto-formatting via chat templates and parallel tool calling. this has been on backburner a bit, but revisiting this week |
@willccbb makes sense. anything I can help with? |
%23%23 Description
This PR introduces the
τ²-bench
(tau2-bench) environment to the Verifiers framework, aiming for a verbatim translation of its logic, including dual-control agent and user tool execution, and its strict evaluation criteria.The primary goal is to accurately mirror the original benchmark's behavior for scoring, prompts, and data flow, ensuring correctness rather than mere similarity.
%23%23 Type of Change
%23%23 Testing
vf-eval
runs without framework errors)vf-eval
runs)python -m pytest tests/
(implicit throughvf-eval
runs)%23%23%23 Test Coverage
%23%23 Checklist
%23%23 Additional Notes
This implementation strives for exact parity with the original τ²-bench logic. Key aspects addressed include:
tool_calls
(requiringNone
for no calls) and precise error message formatting to match τ²-bench's expectations.verifiers/envs/environment.py
now automatically deserializesoai_tools
if stored as a JSON string in the dataset'sinfo
column, preventing HuggingFace Dataset schema inference issues for nested structures. This avoids environment-specific