TwoTrim is an open-source prompt compression middleware for LLM applications. It sits between your app and any LLM API — OpenAI, Anthropic, or any OpenAI-compatible endpoint — and removes the tokens your model doesn't need before the request is sent. Your code doesn't change. Your costs do. What it does: → Strips filler words, redundant sentences, and formatting noise (lossless) → Semantic sentence scoring + Lost-in-the-Middle reordering (balanced) → BART summarization for long documents (aggressive) → FAISS semantic cache — works on similar queries, not just identical ones What makes it different: → CPU-only. No GPU infrastructure required. → Zero refactoring — drop-in base_url swap for any OpenAI-compatible client → Works across providers via LiteLLM, vLLM, and more → Honest benchmarks. The results where it fails are published too. Works best on: document summarization, long-context tasks, and high-volume chatbot/support systems with repeated queries. Does not work well on: extreme multi-hop RAG at aggressive compression. Full benchmark data is public in the repo. Open source. Apache 2.0. Free forever. github.com/overseek944/twotrim

TwoTrim - Cut LLM API costs by 65%. No GPU. No code changes.

Replies