Steven Willmott

Spec27 - Spec-driven testing for AI agents and AI apps

Spec27 is a validation platform for AI agents. It helps teams move beyond manual, vibes-based testing by using machine-readable specifications to generate broader test coverage, catch regressions earlier, and validate both in-house and third-party systems without needing SDK integration or code-level access.

Add a comment

Replies

Best
Steven Willmott
Hello Product Hunt, excited to be here! For the past three years, we’ve been working on formal verification of machine learning models and looking for ways to get deep, relevant test coverage. With language-model-based applications, this is particularly hard since the models and input spaces are massive, plus you often don’t have access to the underlying model. So the techniques from formal verification developed for vision and tabular data don’t translate well, even for tightly constrained use cases. Thinking this through from first principles gave us the epiphany that what you really need for effective testing is a good way to specify the behavior of the agents you’re testing. So Spec27 was born to do this. The platform allows you to create specs that specify behavior in various ways, capture what you want an agent to be robust to, and then automatically generate sets of test cases around this. The approach we’ve taken is also platform- and LLM-agnostic, so integrations can connect to pretty much any agent, no matter where it is hosted. There’s no SDK or AI gateway integration needed. The product is free and in early access, and we’d love to get people on board and trying things out. The direct link to sign up is here: https://dashboard.spec27.ai/signup/ If you have a specific use case in mind, please also reach out and schedule a chat. Details are here: https://www.spec27.ai/ Made with ❤️ in London. Looking forward to your thoughts!
Priya K

@steven_willmott a getting deep relevant test coverage for LLMs is basically Mission Impossible right now because the input space is infinite. Specifying what the agent should be robust to, rather than just hoping it doesn't break, is a huge mental shift. Does Spec27 handle adversarial prompt generation as part of the automatic test sets?

Steven Willmott

Thanks @priya_kushwaha1, yes, that's the biggest part of the challenge: infinite input space + it's not necessarily continuous (so a tiny shift in input might lead to a massive shift in the output). In response to your question, yes, in the platform we have a growing list of adversarial methods that perturb the inputs in different way. You can select which to use, and it'll effectively do a search in that adjacent input space. We use semantic similarity to keep the tests similar despite the variation.

Priya K

@steven_willmott really helpful context, especially the bit about non-continuous input spaces. i'll check out the platform and see how it handles our specific edge cases. congrats on the ship!

Nicolas Grenié
As the agent space is getting more structured we need better tooling. Can’t wait to try s a S27! Does it matter which framework I’m using?
Steven Willmott

Thanks so much@picsoung  - means a lot! Doesn't matter which framework you're using to build the agents. We have a Javascript WASM engine that connects to pretty much anything (and we'll help if it's custom). The testing makes no assumptions about how the agent is built or about your access to the backend.

What are you building agents in?

Jovanca Garnadi

Excited to be part of the team launching /Spec27 today! I care a lot about making AI safer in practice, so it’s really nice to share something we’ve been building around that. Happy to chat with anyone working on agents, evals, or validation :)

Mark Cheshire

Looking forward to try Spec27. The non-determinisim of agent results is ok if the result is an equivalent meaning, but when it goes off to hallucinate a different meaning the outcome can range from oops to a total disaster. The trust of the end users and of the company employees is difficult to repair once the agent has broken that trust. This looks like it can take us on that path to improve the quality of outcomes, and build better trust in the agents. Very nice!