Imbue develops tools that help people think, create, and build. We believe technology should be loyal to the user and aligned with human goals.
Replies
Best
Hey Alexander, that line about agents silently swapping in fake data instead of telling you they hit a wall is terrifying. Was there a specific moment where you discovered your agent told you tests passed but never actually ran them?
@vouchy There are really two cases you touched on 1) fake data 2) didn't run tests. It's fairly easy to reproduce fake data issues, just tell an agent to do something that requires an environment variable they don't have access to, or a piece of software that isn't installed. I personally had this happen when I added support for Vet to call out to Claude Code because Claude Code was not installed on my computer (it's proprietary and all so installing it is a non-starter), and it wrote code and tests that mocked out CC invocations.
Related to not running tests, in my experience, agents are not good at running all relevant tests. Test suites are often slow so there's a contention between being thorough and being quick. Often agents will only run select tests they think they could've impacted, even if there are other tests that were broken by their changes due to second effects.
Replies
Imbue
@vouchy There are really two cases you touched on 1) fake data 2) didn't run tests. It's fairly easy to reproduce fake data issues, just tell an agent to do something that requires an environment variable they don't have access to, or a piece of software that isn't installed. I personally had this happen when I added support for Vet to call out to Claude Code because Claude Code was not installed on my computer (it's proprietary and all so installing it is a non-starter), and it wrote code and tests that mocked out CC invocations.
Related to not running tests, in my experience, agents are not good at running all relevant tests. Test suites are often slow so there's a contention between being thorough and being quick. Often agents will only run select tests they think they could've impacted, even if there are other tests that were broken by their changes due to second effects.