Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
bb00d91
add README
NicolasAG Jun 16, 2025
dc81770
increase env session inactivity timout
NicolasAG Jun 17, 2025
e60d4c1
update readme
NicolasAG Jun 17, 2025
f9e45c2
move miniwob to domains/
NicolasAG Jun 18, 2025
8cdbd06
fix
NicolasAG Jul 7, 2025
5510982
fix path
NicolasAG Jul 7, 2025
07e858c
return RuntimeError instead of HTTPException because not pickable
NicolasAG Jul 7, 2025
5e56896
add env_call_timeout
NicolasAG Jul 8, 2025
c06b768
update gpu fractions
NicolasAG Jul 8, 2025
b1ad285
set kl coef to 0
NicolasAG Jul 8, 2025
6bbe977
Merge remote-tracking branch 'origin/main' into debug_miniwob
NicolasAG Jul 8, 2025
c8ac64d
update max seq len
NicolasAG Jul 8, 2025
b87a6d1
revert to json instead of tool use agent
NicolasAG Jul 9, 2025
824d841
update README
NicolasAG Jul 9, 2025
8d170ec
debug overflow counter
NicolasAG Jul 10, 2025
21a1b2a
fix prompts
NicolasAG Jul 10, 2025
05b6794
update readme
NicolasAG Jul 11, 2025
ef6b2b0
flag tape as invalid instead of raising http errors
NicolasAG Jul 21, 2025
0abc2b0
use redis
NicolasAG Jul 21, 2025
d3f6889
track task names instead of data splits
NicolasAG Jul 21, 2025
9c319e3
fix
NicolasAG Jul 21, 2025
92c8a93
remove unused var in new tapeagent remote_env
NicolasAG Jul 22, 2025
edf4d00
use BaseMetrics
NicolasAG Jul 23, 2025
28749e0
fix
NicolasAG Jul 23, 2025
a4f9f79
keep track of time taken
NicolasAG Jul 23, 2025
8a6120f
send per step times to wandb
ollmer Jul 24, 2025
d1d1836
Merge remote-tracking branch 'origin/main' into debug_miniwob
NicolasAG Jul 25, 2025
5eb3a4e
use all miniwob tasks
NicolasAG Jul 25, 2025
75d3c9c
default save checkpoints
NicolasAG Jul 28, 2025
6b97c7b
update vllm max tokens
NicolasAG Jul 28, 2025
d3cf30b
assert group size is as expected
NicolasAG Jul 28, 2025
4c50f1f
assert finetuning length is as much as vllm max length
NicolasAG Jul 28, 2025
ff61d73
update finetuning & vllm max lengths
NicolasAG Jul 28, 2025
a00e6e6
debug agent
NicolasAG Jul 28, 2025
6f149c8
use ppo & upd config
NicolasAG Aug 8, 2025
2ae2dd8
update readme
NicolasAG Aug 8, 2025
913c8e2
stop training after 1k steps
NicolasAG Aug 11, 2025
402eeb2
scale up env servers by llm_servers
NicolasAG Aug 20, 2025
58f31cc
reweight actor/trainer
NicolasAG Aug 20, 2025
4101d77
add massimo miniwob split
NicolasAG Aug 20, 2025
b00e476
cleanup
NicolasAG Aug 20, 2025
0b56125
update agent reflection node
NicolasAG Aug 21, 2025
9b0a74c
towards massimo setup
NicolasAG Aug 22, 2025
e6e735d
Merge remote-tracking branch 'origin/main' into debug_miniwob
NicolasAG Aug 22, 2025
ef46f39
upd configs
NicolasAG Aug 28, 2025
1274748
upd
NicolasAG Aug 28, 2025
b16d45c
revert reward calculation
NicolasAG Aug 28, 2025
9e61c35
update massimo cfg to grpo
NicolasAG Aug 28, 2025
ef884f2
test with ppo
NicolasAG Aug 28, 2025
537ec7a
update configs
NicolasAG Sep 2, 2025
7a4e73f
add retry mechanism for agent loop
NicolasAG Sep 2, 2025
42e811e
add 30min timeout to rollout function
NicolasAG Sep 3, 2025
a4e8f5f
upd configs
NicolasAG Sep 5, 2025
95b735b
upd
NicolasAG Sep 5, 2025
8616303
upd configs
NicolasAG Sep 5, 2025
923cf6a
reduce n_env
NicolasAG Sep 6, 2025
44a033f
boost preprocess power
NicolasAG Sep 6, 2025
2918d1f
pop old data
NicolasAG Sep 6, 2025
dacaa1f
do not save playwright traces & screenshots
NicolasAG Sep 7, 2025
fcee5ee
return empty aggregate stats if empty stats
NicolasAG Sep 7, 2025
631389f
increase preprocessor power
NicolasAG Sep 7, 2025
f791211
better error handling
NicolasAG Sep 8, 2025
c54d900
fix
NicolasAG Sep 8, 2025
ea4918a
reduce timeouts
NicolasAG Sep 9, 2025
e5fca10
log number of groups done so far
NicolasAG Sep 12, 2025
df66a88
log everything if populate_rl_data fails
NicolasAG Sep 12, 2025
c8d0171
monitor env servers and reset if needed
NicolasAG Sep 12, 2025
981cd85
better health message
NicolasAG Sep 12, 2025
9c755ed
small fix
NicolasAG Sep 13, 2025
0b8a24d
better logs
NicolasAG Sep 26, 2025
cd27e30
always check the worker before launching the agent on it + more detai…
NicolasAG Sep 26, 2025
f9ce99e
log stack trace
NicolasAG Sep 29, 2025
60fb042
small cleanup
NicolasAG Sep 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions conf/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ llm:
temperature: 1.0
test_llm:
parameters:
max_tokens: 16000
max_tokens: 8192
temperature: 1.0
top_p: 0.95
top_k: 50
Expand All @@ -67,6 +67,7 @@ vllm_config:
tensor-parallel-size: 1
pipeline-parallel-size: 1
generation-config: vllm
max_model_len: 10000

world:
replicas: 1
Expand All @@ -75,7 +76,8 @@ world:
preprocessor_fraction: 0
finetune_fraction: 4

env_replicas: 2
# Number of environment servers per actor VLLM server
env_replicas_per_actor: 1

actor_group_port: 9000
environment_start_port: 7777
Expand Down
115 changes: 70 additions & 45 deletions conf/miniwob.yaml
Original file line number Diff line number Diff line change
@@ -1,34 +1,32 @@
defaults:
- base
- override streams: redis
- override finetune: ppo
- _self_

world:
actor_fraction: 4
preprocessor_fraction: 1
finetune_fraction: 3
actor_fraction: 2
preprocessor_fraction: 0
finetune_fraction: 6

# debug:
# mode: actor
save_tapes: False

output_dir: results/miniwob_debug/${now:%Y-%m-%d}/${now:%H-%M-%S}
output_dir: results/miniwob/${now:%Y-%m-%d}/${now:%H-%M-%S}
model_path: meta-llama/Llama-3.1-8B-Instruct

finetune:
save_checkpoint_steps: 10
seq_length: 4096
seq_length: 16384 # input + output tokens
max_train_steps: 1000 # 1000 optim steps = 1000 * bs samples
train_batch_size: 1
gradient_accumulation_passes: 1024
learning_rate: 1e-6
optim: adamw_torch
rl:
kl_coef: 0.01 # GRPO beta coefficient
reward_minus_kl_coef: 0.0 # RLOO beta coefficient
use_advantages: true
algo: grpo

eval_every_n_versions: 10240 # 1024 effective bs * 10 "optim steps"

llm:
parameters:
max_tokens: 3072
max_tokens: 4096 # output tokens
temperature: 1.0
test_llm:
parameters:
Expand All @@ -39,24 +37,37 @@ test_llm:

vllm_config:
vllm_kwargs:
enable-auto-tool-choice: ""
tool-call-parser: llama3_json # use hermes for qwen
chat_template: pipelinerl/miniwob/tool_chat_template_llama3.1_json.jinja # copy pasted from https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.1_json.jinja
enforce-eager: "" # speed the actor llm startup a bit
max_model_len: 16384 # input + output tokens

actor:
rollout_policy: pipelinerl.miniwob.rollouts.generate_miniwob_rollout
rollout_policy: pipelinerl.domains.miniwob.rollouts.generate_miniwob_rollout
shared_memory_entry_size: 100000000
llm_max_rollouts: 32

preprocess:
shared_memory_entry_size: 1000000000
n_workers: 32 # Increase from 8
chunk_n_groups: 8 # Increase from 2 for better throughput
# queue for loaded raw groups
raw_queue_size: 32 # Increase from 8
# queue for processed chunks of multiple groups
input_queue_size: 64 # Increase from 32
# queue for ready chunks for multiple groups
output_queue_size: 64 # Increase from 32
# ring buffer to replace old samples with new ones when training is slow
ring_buffer_size: 1024 # Increase from 128
# "virtual" sample queue per lead trainer
max_ready_samples_per_lead: 256 # Increase from 64
shared_memory_entry_size: 1000000000 # Increase from 100M

# AGENT CONFIGURATION
agent_max_loops: 10 # max number of agent - environment interactions for each task
agent_attempts: 3 # number of attempts to run the agent (retry on errors)
rollout_timeout: 600 # overall timeout for entire rollout in seconds (10 minutes)
reward_computation: nico
agent:
_target_: tapeagents.agent.Agent
name : web_agent
max_iterations: 4 # max number of iterations (make_prompt + llm? + generate_steps) for each loop
max_iterations: 4 # max number of iterations (make_prompt + llm + generate_steps) for each loop
store_llm_calls: true
templates:
system_prompt: |
Expand All @@ -65,50 +76,64 @@ agent:
Keep your replies concise and direct. Prioritize clarity and avoid over-elaboration.
You will be provided with the content of the current page and a task from the user.
Do not express your emotions or opinions about the user question.
allowed_tools: |
You have access to the following tools:
{tools_description}
thought_format: |
Important! Respond with the plain text, do not include any JSON or code.
Do not output anything besides what I asked in this message.
allowed_steps: |
You are allowed to produce ONLY steps with the following json schemas:
{allowed_steps}
Do not reproduce schema when producing the steps, use it as a reference.
json_format: |
Important! Respond with very simple parsable JSON!
Do not use any special characters or code. Do not use new lines, tabs, or any other formatting inside the JSON.
Do not output anything besides one simple JSON object.
nodes:
- _target_: examples.rl_webagent.agent.WebNode
name: set_goal
system_prompt: ${agent.templates.system_prompt}
guidance: |
Produce the thought that describes the intended solution to the task. In the reasoning lines:
Produce the reasoning_thought step that describes the intended solution to the task. In the reasoning lines:
- review the instructions from the user and the content of the page.
- outline the main task to be accomplished and the steps to be taken to achieve it.
- produce definiton of done, that will be checked later to verify if the task was completed.
${agent.templates.thought_format}
steps_prompt: ${agent.templates.allowed_tools}
Produce only one reasoning_thought step!
${agent.templates.json_format}
steps_prompt: ${agent.templates.allowed_steps}
steps:
- tapeagents.steps.ReasoningThought
trim_obs_except_last_n: 3 # keep the last 3 observations from the tape in prompt messages
max_chars_page_observation: 3000 # keep up to 3000 chars in PageObservation steps
- _target_: examples.rl_webagent.agent.WebNode
name: reflect
system_prompt: ${agent.templates.system_prompt}
guidance: |
Review the current state of the page and previous steps to find the best possible next action to accomplish the task.
Produce the reflection_thought to describe the current page state, reflect on your last action, describe what is left to do, and what will be the immediate next action.
Produce only one reflection_thought step!
${agent.templates.thought_format}
steps_prompt: ${agent.templates.allowed_tools}
Produce the reasoning_thought step that describes the current state of the page, the previous actions, and what should be the next best action to accomplish the task. In the reasoning lines:
- think about which information could be relevant to the given task, note relevant BIDs and coordinates.
- describe the last action taken, what were its expected effects on the page, versus the actual effects you can observe. Are they the same or not? if not, what could have gone wrong?
- check if you are stuck with repeating the same action over and over again, if so, try something else and change the action.
- check if you think the task is done, if not give a detailed list of actions to do next to accomplish the task.
- finally, if the task is not done, describe the immediate next action to be performed and its expected effect on the page.
Produce only one reasoning_thought step! Be brief and to the point. You can skip some details if they are not relevant for this step.
${agent.templates.json_format}
steps_prompt: ${agent.templates.allowed_steps}
steps:
- tapeagents.steps.ReasoningThought
trim_obs_except_last_n: 3 # keep the last 3 observations from the tape in prompt messages
max_chars_page_observation: 3000 # keep up to 3000 chars in PageObservation steps
- _target_: examples.rl_webagent.agent.WebNode
name: act
system_prompt: ${agent.templates.system_prompt}
guidance: |
Produce the single next tool call to be performed with the current page.
If you think that the task is solved, call the FinalAnswer.
Produce the next action to be performed with the current page.
If you think that the task is solved, produce the final_answer_action.
You can interact with the page elements using their BIDs or coordinates as arguments for actions.
HINTS:
- You can use the BIDs of the elements or the mouse position in x, y coordinates to interact with them.
- To select value in a dropdown or combobox, ALWAYS use SelectOption tool.
- To select value in a dropdown or combobox, ALWAYS use select_action.
- To click on a checkbox or radio button, ALWAYS use BID (or coordinates) of the corresponding Text and not the BID (or coordinates) of the element itself.
- Press enter key to submit the search query.
- Always produce only one step at a time.
- Step kind is always lowercase and underscore separated.
${agent.templates.json_format}
steps_prompt: ${agent.templates.allowed_steps}
use_known_actions: true
use_function_calls: true
steps:
- examples.rl_webagent.steps.FinalAnswerAction
trim_obs_except_last_n: 3 # keep the last 3 observations from the tape in prompt messages
Expand All @@ -119,18 +144,18 @@ agent:
# ENVIRONMENT CONFIGURATION
start_attempts: 3 # number of attempts to start each task
environment:
_target_: pipelinerl.miniwob.environment_server.WebEnvironmentServer
miniwob_url: file:///home/toolkit/miniwob-plusplus/miniwob/html/miniwob/
n_envs: 64
_target_: pipelinerl.domains.miniwob.environment_server.WebEnvironmentServer
miniwob_url: ???
n_envs: 32
host: "0.0.0.0"
max_session_inactivity_secs: 300
env_call_timeout: 60 # timeout for each environment call (e.g. start_task, act, etc.)
web_env_target: examples.rl_webagent.environment.WebEnvironment
exp_path: ${output_dir}/env_server
exp_path: null
headless: true
observation_format: html

# DATASET CONFIGURATION
dataset_loader: pipelinerl.miniwob.load_tasks.load_tasks
dataset_loader: pipelinerl.domains.miniwob.load_tasks.load_tasks
dataset_loader_params:
train_split: 0.6 # 0.6 of tasks for training, 0.4 for testing
seeds: [0, 42, 1337, 900, 103]
Expand Down
10 changes: 10 additions & 0 deletions conf/miniwob_grpo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
defaults:
- miniwob
- override finetune: grpo
- _self_

finetune:
seq_length: 16384 # input + output tokens
max_train_steps: 1000 # 1000 optim steps = 1000 * bs samples
train_batch_size: 1
gradient_accumulation_passes: 1024
15 changes: 15 additions & 0 deletions conf/miniwob_massimo_grpo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
defaults:
- miniwob_grpo
- _self_

train_dataset_names:
- massimo_train
test_dataset_names:
- massimo_test

reward_computation: massimo

finetune:
gradient_accumulation_passes: 512

eval_every_n_versions: 5120 # 512 effective bs * 10 "optim steps"
15 changes: 15 additions & 0 deletions conf/miniwob_massimo_ppo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
defaults:
- miniwob
- _self_

train_dataset_names:
- massimo_train
test_dataset_names:
- massimo_test

reward_computation: massimo

finetune:
gradient_accumulation_passes: 512

eval_every_n_versions: 5120 # 512 effective bs * 10 "optim steps"
5 changes: 4 additions & 1 deletion pipelinerl/actor.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,7 @@ async def rollout_and_maybe_produce_result(
f"groups in progress: {len(group_rollouts)}, "
f"rollouts started so far: {started_rollouts}, "
f"rollouts finished so far: {finished_rollouts}, "
f"groups started so far: {group_id}, "
f"max group size in bytes: {result_queue.max_actual_entry_size()}, "
)
last_logged = time.time()
Expand Down Expand Up @@ -463,6 +464,9 @@ def run(self, dataset: list[tuple[str, dict]]):

assert isinstance(rollout_results, list)
assert isinstance(rollout_results[0], RolloutResult)
assert len(rollout_results) == attempts, (
f"Expected {attempts} rollouts, got {len(rollout_results)}"
)
group_samples = sum(len(r.training_texts) for r in rollout_results)

published_samples += group_samples
Expand All @@ -479,7 +483,6 @@ def run(self, dataset: list[tuple[str, dict]]):
f" {in_progress} groups in progress"
)


self.update_stats(rollout_results=rollout_results)

finished_groups += 1
Expand Down
34 changes: 34 additions & 0 deletions pipelinerl/domains/miniwob/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Miniwob example

## Prerequesites

### TapeAgents

Clone [TapeAgents](https://github.com/ServiceNow/TapeAgents/) in your parent folder and install it.
```bash
cd ..
git clone [email protected]:ServiceNow/TapeAgents.git
cd TapeAgents
pip install -e .
pip install 'tapeagents[finetune,converters]=0.1.12'
cd ../PipelineRL
```

Make sure to add the TapeAgent folder to your python path.
```bash
export PYTHONPATH="/path/to/TapeAgents:$PYTHONPATH"
```

### Miniwob

see setup here: https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md

### Playwright

The environment server will need to have playwright installed.

`playwright install`

## Launch Command

`python -m pipelinerl.launch --config-name miniwob environment.miniwob_url=file:///PATH/TO/miniwob-plusplus/miniwob/html/miniwob/`
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@ def __init__(self,
exp_path: str,
headless: bool = True,
observation_format: str = "html",
max_session_inactivity_secs: int = 600,
env_call_timeout: int = 60,
):
os.environ["MINIWOB_URL"] = miniwob_url
# Remote environment server configuration
self.n_envs = n_envs
self.host = host
self.max_session_inactivity_secs = max_session_inactivity_secs
self.env_call_timeout = env_call_timeout
# Individual web environment configuration
self.web_env_target = web_env_target
self.exp_path = exp_path
self.headless = headless
Expand All @@ -29,7 +31,7 @@ def launch(self, port: int):
"""
Serve the web environment in TapeAgent.
"""
env_server = EnvironmentServer(n_envs=self.n_envs, host=self.host, port=port, max_session_inactivity_secs=self.max_session_inactivity_secs)
env_server = EnvironmentServer(n_envs=self.n_envs, host=self.host, port=port, env_call_timeout=self.env_call_timeout)
env_server.launch(OmegaConf.create({
"_target_": self.web_env_target,
"exp_path": self.exp_path,
Expand Down
Loading