Hi!
I find that some test functions failed against a generated executable because the test expected to have some string in stderr, while the executable gave that string in stdout.
I also noticed that ProgramBench’s container execution helper appears to merge stdout/stderr streams here.
Is stdout/stderr separation intentionally unavailable to agents during exploration? (I think it is still possible for an agent to manually test stream placement with shell redirection, e.g. cmd >/tmp/out 2>/tmp/err, but the default observation format seems to make the distinction easy to miss.)
Thank you!
The commands I ran
uv run --with mini-swe-agent mini-extra programbench \
--filter "abishekvashok__cmatrix.5c082c6" \
--output output \
--model openai/gpt-5.4
uv run programbench eval output
Score log
Evaluation Summary
Instance Score Comment
abishekvashok__cmatrix.5c082c6 94 507 tests
Average 94 1 instances
A failed test in eval.json
- task:
abishekvashok/cmatrix
- test function name:
eval.tests.test_cmatrix.TestColorOptions.test_color_invalid_shows_error_message
- branch:
1b991a57d4e9
self = <test_cmatrix.TestColorOptions object at 0x7732071f2bf0>
def test_color_invalid_shows_error_message(self):
"""Test invalid color name produces error message."""
result = run("-C", "purple")
assert result.returncode == 0 # Exits with 0 but shows error
> assert b"Invalid color" in result.stderr
E AssertionError: assert b'Invalid color' in b''
E + where b'' = CompletedProcess(args=['./executable', '-C', 'purple'], returncode=0, stdout=b' Invalid color selection\n Valid colors are green, red, blue, white, yellow, cyan, magenta and black.\n', stderr=b'').stderr
eval/tests/test_cmatrix.py:120: AssertionError
Hi!
I find that some test functions failed against a generated executable because the test expected to have some string in stderr, while the executable gave that string in stdout.
I also noticed that ProgramBench’s container execution helper appears to merge stdout/stderr streams here.
Is stdout/stderr separation intentionally unavailable to agents during exploration? (I think it is still possible for an agent to manually test stream placement with shell redirection, e.g.
cmd >/tmp/out 2>/tmp/err, but the default observation format seems to make the distinction easy to miss.)Thank you!
The commands I ran
uv run --with mini-swe-agent mini-extra programbench \ --filter "abishekvashok__cmatrix.5c082c6" \ --output output \ --model openai/gpt-5.4 uv run programbench eval outputScore log
A failed test in eval.json
abishekvashok/cmatrixeval.tests.test_cmatrix.TestColorOptions.test_color_invalid_shows_error_message1b991a57d4e9