-
Notifications
You must be signed in to change notification settings - Fork 1
Fixes issues when running Goose on Windows. #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
…es by using consistent printing methods.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
…I calls fail, and to DummyMetric if Claude fails.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful.
… logging in runner.py. Added test configuration to support log capture for assertions that downgrade was successful. Addressed ruff warnings.
This comment was marked as outdated.
This comment was marked as outdated.
…ric after the downgrade.
This comment was marked as outdated.
This comment was marked as outdated.
… verbosity temporarily to debug Claude judge unit test on build server. Adjusted logic to work when multiple coders are specified. Improved log messages.
This comment was marked as outdated.
This comment was marked as outdated.
…ic to DummyMetric.
This comment was marked as outdated.
This comment was marked as outdated.
… for the quota exhaustion fallback logic.
This comment was marked as outdated.
This comment was marked as outdated.
…ic downgrade to DummyMetric on quota check failure. Added notes on potential improvements to unit tests.
This comment was marked as outdated.
This comment was marked as outdated.
justaddcoffee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One test is failing, seemingly because it can't call goose correctly, but this might be a problem with my goose install
| @@ -0,0 +1,139 @@ | |||
| import logging | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is failing for me, but maybe that's a me problem
(metacoder) ~/PythonProject/metacoder literature-eval-issues $ uv run pytest -v tests/test_evals/test_claude_judge.py
====================================================================================== test session starts ======================================================================================
platform darwin -- Python 3.10.17, pytest-8.4.1, pluggy-1.6.0 -- /Users/jtr4v/PythonProject/metacoder/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/jtr4v/PythonProject/metacoder
configfile: pytest.ini
plugins: deepeval-3.3.4, repeat-0.9.4, anyio-4.9.0, rerunfailures-12.0, asyncio-1.1.0, xdist-3.8.0, langsmith-0.4.10
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items
tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success FAILED [ 50%]
tests/test_evals/test_claude_judge.py::test_correctnessmetric_downgrade_success PASSED [100%]Running teardown with pytest sessionfinish...
=========================================================================================== FAILURES ============================================================================================
______________________________________________________________________________ test_claude_judge_downgrade_success ______________________________________________________________________________
tmp_path = PosixPath('/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judge_downgrade_su0')
caplog = <_pytest.logging.LogCaptureFixture object at 0x10b835210>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x10b835450>
def test_claude_judge_downgrade_success(tmp_path, caplog, monkeypatch):
"""Test that ClaudeJudge is used when OpenAI is disabled."""
# TODO: This test should avoid running the coder and only perform the eval step.
# Otherwise, it is impossible to get to the eval step if no valid API key is present or no quota is available (testing the wrong part of the process).
runner = EvalRunner()
try:
dataset = runner.load_dataset(
Path("tests/input/goose_eval_claude_downgrade_test.yaml")
)
# Unfortunately, there is nothing available in the eval results that indicate which model DeepEval used.
# One enhancement might be to introduce metric_model=claude-3-5-sonnet-20240620 to each result at eval time.
# Instead, resort to capturing the WARNING logs for assertions related to the downgrade.
with caplog.at_level(logging.WARNING):
# Temporarily set an invalid OPENAI_API_KEY in order to force OpenAI calls to fail.
# (no need to reset, `monkeypatch` automatically reverts after the test)
monkeypatch.setenv("OPENAI_API_KEY", "fake-api-key-for-testing")
> results = runner.run_all_evals(
dataset, workdir=tmp_path, coders=["goose", "dummy"]
)
tests/test_evals/test_claude_judge.py:30:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/metacoder/evals/runner.py:502: in run_all_evals
results = self.run_single_eval(
src/metacoder/evals/runner.py:279: in run_single_eval
output: CoderOutput = coder.run(case.input)
src/metacoder/coders/goose.py:156: in run
result = self.run_process(command, env)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = GooseCoder(workdir='/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judg...'mcp-simple-pubmed'], 'envs': {'PUBMED_EMAIL': '[email protected]'}, 'env_keys': ['PUBMED_EMAIL'], 'bundled': None}}})])
command = ['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']
env = {'BUN_INSTALL': '/Users/jtr4v/.bun', 'CLAUDE_CODE_SSE_PORT': '49371', 'COMMAND_MODE': 'unix2003', 'CONFIDENT_AI_RUN_TEST_NAME': 'test_claude_judge_downgrade_success', ...}
def run_process(
self, command: list[str], env: dict[str, str] | None = None
) -> CoderOutput:
"""Run a process and return the output.
Args:
command: Command to run
env: Environment variables to use
Returns:
Tuple of stdout and stderr
Example:
>>> from metacoder.coders.dummy import DummyCoder
>>> coder = DummyCoder(workdir="tmp")
>>> output = coder.run("hello")
>>> output.stdout
'you said: hello'
"""
if env is None:
env = self.expand_env(self.env)
process = subprocess.Popen(
command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
env=env,
bufsize=1,
universal_newlines=True,
)
stdout_lines: list[str] = []
stderr_lines: list[str] = []
# check verbosity level
quiet_mode = logger.getEffectiveLevel() <= logging.INFO
def stream_output(pipe, output_lines, stream):
for line in iter(pipe.readline, ""):
if not quiet_mode:
print(line.rstrip(), file=stream)
output_lines.append(line)
pipe.close()
# Start threads for both stdout and stderr
stdout_thread = threading.Thread(
target=stream_output, args=(process.stdout, stdout_lines, sys.stdout)
)
stderr_thread = threading.Thread(
target=stream_output, args=(process.stderr, stderr_lines, sys.stderr)
)
stdout_thread.start()
stderr_thread.start()
# Wait for process and threads to complete
return_code = process.wait()
stdout_thread.join()
stderr_thread.join()
stdout_text = "\n".join(stdout_lines)
stderr_text = "\n".join(stderr_lines)
if return_code != 0:
error = subprocess.CalledProcessError(return_code, command)
error.stdout = stdout_text
error.stderr = stderr_text
> raise error
E subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
src/metacoder/coders/base_coder.py:222: CalledProcessError
During handling of the above exception, another exception occurred:
tmp_path = PosixPath('/private/var/folders/vc/lfqgrrhn56d9yj5fbxbw6qr00000gt/T/pytest-of-jtr4v/pytest-2/test_claude_judge_downgrade_su0')
caplog = <_pytest.logging.LogCaptureFixture object at 0x10b835210>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x10b835450>
def test_claude_judge_downgrade_success(tmp_path, caplog, monkeypatch):
"""Test that ClaudeJudge is used when OpenAI is disabled."""
# TODO: This test should avoid running the coder and only perform the eval step.
# Otherwise, it is impossible to get to the eval step if no valid API key is present or no quota is available (testing the wrong part of the process).
runner = EvalRunner()
try:
dataset = runner.load_dataset(
Path("tests/input/goose_eval_claude_downgrade_test.yaml")
)
# Unfortunately, there is nothing available in the eval results that indicate which model DeepEval used.
# One enhancement might be to introduce metric_model=claude-3-5-sonnet-20240620 to each result at eval time.
# Instead, resort to capturing the WARNING logs for assertions related to the downgrade.
with caplog.at_level(logging.WARNING):
# Temporarily set an invalid OPENAI_API_KEY in order to force OpenAI calls to fail.
# (no need to reset, `monkeypatch` automatically reverts after the test)
monkeypatch.setenv("OPENAI_API_KEY", "fake-api-key-for-testing")
results = runner.run_all_evals(
dataset, workdir=tmp_path, coders=["goose", "dummy"]
)
# Test that the quota exhaustion fallback logic worked as expected.
assert (
"OpenAI API quota exhausted or server unavailable; disabling OpenAI for DeepEval."
in caplog.text
)
# Test that the new evaluation judge was correctly selected for the metric model downgrade.
assert (
"Downgrading CorrectnessMetric model from gpt-4.1 to claude-3-5-sonnet-20240620."
in caplog.text
)
# Test that the eval completed by checking for a non-zero score.
assert results[0].score > 0, (
f"Expected a {results[0].metric_name} score for {results[0].case_name}."
)
except Exception as e:
# Test that fallback logic does not result in an Exception.
logger.error(f"An error occurred: {e}")
logging.error(traceback.format_exc())
> assert False # This assertion will fail if an Exception is caught here.
E assert False
tests/test_evals/test_claude_judge.py:55: AssertionError
------------------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------------------
2025-09-02 14:01:41,552 [ERROR] test_claude_judge: An error occurred: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
2025-09-02 14:01:41,553 [ERROR] root: Traceback (most recent call last):
File "/Users/jtr4v/PythonProject/metacoder/tests/test_evals/test_claude_judge.py", line 30, in test_claude_judge_downgrade_success
results = runner.run_all_evals(
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 502, in run_all_evals
results = self.run_single_eval(
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 279, in run_single_eval
output: CoderOutput = coder.run(case.input)
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/goose.py", line 156, in run
result = self.run_process(command, env)
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/base_coder.py", line 222, in run_process
raise error
subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
------------------------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------------------------
thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61:
called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--------------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------------
ERROR test_claude_judge:test_claude_judge.py:53 An error occurred: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
ERROR root:test_claude_judge.py:54 Traceback (most recent call last):
File "/Users/jtr4v/PythonProject/metacoder/tests/test_evals/test_claude_judge.py", line 30, in test_claude_judge_downgrade_success
results = runner.run_all_evals(
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 502, in run_all_evals
results = self.run_single_eval(
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/evals/runner.py", line 279, in run_single_eval
output: CoderOutput = coder.run(case.input)
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/goose.py", line 156, in run
result = self.run_process(command, env)
File "/Users/jtr4v/PythonProject/metacoder/src/metacoder/coders/base_coder.py", line 222, in run_process
raise error
subprocess.CalledProcessError: Command '['goose', 'run', '-t', 'What is the first sentence of section 2 in PMID: 28027860?']' returned non-zero exit status 101.
==================================================================================== short test summary info ====================================================================================
FAILED tests/test_evals/test_claude_judge.py::test_claude_judge_downgrade_success - assert False
================================================================================== 1 failed, 1 passed in 0.70s ==================================================================================
(metacoder) ~/PythonProject/metacoder literature-eval-issues $ goose run -t 'What is the first sentence of section 2 in PMID: 28027860?'
starting session | provider: openai model: gpt-4o
logging to /Users/jtr4v/.local/share/goose/sessions/20250902_140206.jsonl
working directory: /Users/jtr4v/PythonProject/metacoder
To find the first sentence of section 2 in the document with PubMed ID (PMID) 28027860, you would need access to the journal article in question. PubMed itself often provides abstracts and metadata, but accessing the full text is typically subject to journal access, which may require a subscription or institutional access.
For full content, you might try:
- Accessing through a library or educational institution that provides access to the journal.
- Checking if the journal offers open access articles.
- Searching for the article on the publisher's website or databases like ResearchGate where authors sometimes share full texts.
If you have access to the document through any of these means or have a PDF version, you could locate section 2 and find the first sentence directly. If not, accessing the specific content might require subscription access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source of your error:
thread 'main' panicked at /Users/runner/work/goose/goose/crates/goose-cli/src/session/builder.rs:53:61:
called `Result::unwrap()` on an `Err` value: Failed to access keyring: Platform secure storage failure: A default keychain could not be found.
It is definitely related to your goose configuration. I had this same keyring issue, but only when running under WSL2 Ubuntu. The fix was to install and start gnome-keyring in the background (see blog post). However, on Mac, this probably has a similar cause but different solution. I suspect your goose is configured to store your API keys in a way that requires some part of your login shell environment to be available.
…to avoid default encoding errors.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
…herwise, create an empty file as UTF-8. Partially addresses Issue #24.
…verwrite. Enforced UTF-8 encoding, switched to safe_dump and added document delimiter between records. Also simplified document generation. Fixes issue #24. Added second test case to literature_mcp_encoding_test.yaml for testing.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
… on Windows (where the default codepage is cp1252).
|
Claude finished @ct-parker's task —— View job Todo List:
Overall Assessment: Strong Approval - Excellent PR ⭐This PR successfully addresses multiple Windows compatibility issues and introduces valuable evaluation infrastructure improvements. The changes demonstrate excellent adherence to best practices and defensive programming principles. Major Changes Reviewed:1. UTF-8 Encoding Fix (Issue #14) 🎯File: - with open(session_file, "r") as f:
+ with open(session_file, "r", encoding="utf-8") as f:✅ Excellent Fix:
2. Consistent Logging Precision 📊All coder files: Standardized timing logs to use
✅ Code Quality Improvements:
3. New Evaluation Infrastructure 🧪New file:
Enhanced:
4. CLI Output Improvements 💻File:
Technical Analysis by Category:✅ Code Quality & Best PracticesStrengths:
Follows repository standards:
✅ Bug Prevention & FixesIssues Fixed:
Defensive Features Added:
✅ Performance ConsiderationsPerformance Improvements:
No performance regressions: All changes maintain or improve existing performance characteristics. ✅ Security AssessmentSecurity Strengths:
No security concerns identified. ✅ Test Coverage AssessmentExcellent Test Coverage:
Test Quality:
🔍 Addressing Test Failure ConcernRegarding @justaddcoffee's test failure: Analysis:
Recommendation: Specific Code Review Comments:
|

Commit 1:
Commit 2:
Commit 3:
Commit 4:
Commit 5:
Commit 6:
Commit 7:
Commit 8:
Commit 9:
Commit 10:
Commit 11:
Commit 12:
CorrectnessMetrictoDummyMetric.RuntimeErrortoExceptionwhen Anthropic API key is missing.Commit 13:
Commit 14:
Commit 15:
Commit 16:
Commit 17:
Commit 18: (reverted by Commit 20)
Fixes Eval results are repeatedly overwritten when running a test suite #24. Modified save_results to append to existing output file rather than overwrite.Enforced UTF-8 encoding.Switched tosafe_dumpand added document delimiter between records.Simplified document generation.Added second case toliterature_mcp_encoding_test.yamlfor testing.Commit 19:
Commit 20:
open()for Windows support.Commit 21:
Test Results (successfully downgrades from OpenAI to Claude, in contrast to same example from Issue #18 )