Hey, I'm Sacha, co-founder at @Edgee
Over the last few months, we've been working on a problem we kept seeing in production AI systems:
LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge and token spend quickly becomes the main bottleneck.
So we built a token compression layer designed to run before inference.
Hey Product Hunt 👋
I’m Sacha, co-founder of Edgee. Thanks for checking us out!
We built Edgee because we kept seeing the same thing everywhere:
AI cost is going crazy!!!
LLMs are easy to try, but once you ship them in production, costs explode and reliability becomes a mess.
Most teams start with direct calls to OpenAI or Anthropic… or simply using a coding assistant... then quickly end up dealing with:
Unpredictable token spend
Multiple provider APIs
Outages / rate limits
Security & privacy constraints
And no real observability across teams
Edgee is an AI Gateway built to reduce LLM costs and simplify production inference.
It gives you a single OpenAI-compatible API across providers, plus a layer of intelligence around inference:
✅ Token compression to remove redundant tokens and cut costs, with no semantic loss
✅ Routing & fallbacks across providers
✅ Observability + cost tracking you can trust
✅ Privacy & security controls (ZDR, BYOK...)
✅ Support for public + private models,
✅ & Edge Tools 🚀
We're launching early and working closely with a small group of design partners, so feedback (even brutal feedback 😅) would mean a lot.
Happy to answer any questions here, and I’d love to hear how you’re handling LLM infra in production today!
Sacha
We're experimenting with cheaper models to control costs, but quality suffers.
Can Edgee help us stay on expensive models but reduce token usage instead?
@pierregodret Yes, that’s exactly what Edgee does.
Edgee optimizes your prompts at the edge using intelligent token compression, removing redundancy while preserving meaning, then forwards the compressed request to your LLM provider of choice. You can also tag requests with metadata to track usage/costs and get alerts when spend spikes.
Happy to discuss this further if you’d like.
Absolutely @pierregodret . With our token-compression model, the LLM bill mechanically decreases, so it's actually a good opportunity to afford a slightly more expensive model... for the same price ;)
@sachamorard But how do you ensure that critical context is not lost after compression.
How do you evaluate your model?
This would be a huge gain, But I am sceptical about quality, because two piece of text might be semantically similar but not mean the same thing.
@sachamorard @somangshu Edgee falling back to the original prompt when BERT similarity drops below threshold is the right production default. You don't silently lose meaning, you just skip the savings on that request. The harder problem is one threshold across all request types. RAG context with repeated chunks compresses well, but structured outputs and few-shot examples are dense and break easily. You end up too conservative on easy wins or too aggressive on fragile stuff. Per-request overrides fix it but now you're maintaining compression config alongside prompt config.
Inyo
As a product guy in the agentic platform space, I’m definitely going to keep a close eye on this one. Good luck with the launch!
@yannick_mthy The agentic space is exactly where we’re seeing things get interesting (and complex) fast, especially with growing context sizes, tool calls, and multi-model orchestration.
Would love to hear how you're currently handling cost + routing on the agent side. Always keen to learn from teams building in this space. Thx
The idea is very interesting. But how does it work?
For example, I have a travel AI — essentially a wrapper around ChatGPT and Gemini. Some of the prompts are huge. How would you reduce the number of tokens? Would you compress my prompts? But that could affect quality.
Could you suggest where something can be replaced with free or cheaper tools? But then you would need to know our product no worse than we do… How do you do that?
PhotoRoom
Congrats on the launch! will closely follow as the topic is complex and moves fast!
@olivier_lemarie1 Thank you ! Indeed, a very exciting and challenging topic and so many things to explore and improve :D We'll soon be having a series a blog posts going through all the details and the research around compression, so stay tuned !
I've been waiting to see companies start tackling this issue. Cost and efficiency are going to be increasingly important once AI platforms are increasingly pressured for revenue.
Congrats on the launch! Will definitely be following this project closely. I've always thought there should be a way to more efficiently provide prompt for LLMs, especially when the latest models consume a lot of them for complex work. Hopefully this will eventually result in less usage rate and higher limits.