Hey, I'm Sacha, co-founder at @Edgee
Over the last few months, we've been working on a problem we kept seeing in production AI systems:
LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge and token spend quickly becomes the main bottleneck.
So we built a token compression layer designed to run before inference.
Love this! Congrats @sachamorard - Great onboarding XP and managed to get going in <5' we will do ❤️. Curious whether and how we can control the compression level and adjust based on endpoints or use case as I imagine there's a quality trade-off?
@sachamorard Super clear. Thanks!
Impressed by the edge-native architecture with 100+ PoPs and the token compression approach.
I noticed Edgee is built with Claude Code. For developers using AI coding agents (Claude Code, Cursor, etc.) that make heavy API calls during development, does Edgee support integration at the agent workflow level? Specifically, can we route AI agent requests through Edgee to compress tool call contexts and reduce token consumption during iterative coding sessions?
Thanks for sharing! Exciting to hear about the Claude Code-specific token compressor. Looking forward to seeing the gains in iterative coding sessions.
Would like to see benchmarks across different model providers and prompt types. If the compression holds under real production loads, this could become default infra in most LLM stacks.
TimeZoneNinja
This looks amazing, @gilles_raymond ! Reducing token costs by 50% is a game changer for anyone building agents for big audience 🤯 Question: How does the compression impact the latency for real-time applications? Congrats on the launch!
@sgiraudie as our architecture is at the edge, there is no sensitive effect on latency.
Love the focus on production problems vs demo features. Does the cost tracking integrate with existing observability tools (DataDog, etc.)?
@nielsrolland You raise a very interesting point! For now, we allow data to be exported in csv/json, but we're already working on integrating partner solutions. If you know our history (which seems to be the case), you know how easy it is for us to send data to any solution... so we're not going to hold back from offering this feature to our users ;)
This would be game-changing for our margins. Does the compression work for both prompts and completions?
@hajar_lamjadab2 yes it is! And it's even more efficient when the context window becomes larger and larger.
ChatPal
Cool idea! Do you get transparency into how prompt was trimmed/manipulated so you can ensure nothing was missed?
@daniele_packard We have information that allows us to understand how our model performs, yes. However, we do not keep the original prompt for obvious privacy reasons. To control the compressed prompt, we perform a similarity analysis by calculating several metrics (rouge, bert, cosine...). And we allow our users to define a threshold that guarantees semantic similarity.