Boost your OpenClaw with accurate data extraction for free
We rushed our open source solution for reliable document processing onto Product Hunt today, a few minutes before the scheduled time, accepting we would sacrifice getting featured. It felt essential to share it ASAP, so that the builders can benefit from it free and locally while it hurts the most.
Anthropic changed its pricing structure on April 4th. Overnight, the cost of running Claude on carefully built agent pipelines became untenable. The practical response, for most, was to downgrade to cheaper models. The quality of outputs dropped noticeably, partly because LLMs weren't built for parsing documents, so they try to read any string in the file they find.
Garbage in, garbage out.
(Claude was different for PDF processing, using full multimodal handling: each page is rasterised to PNG.)
We'd already solved the problem of reliable complex data processing for Health Data Avatar where a parsing error can be fatal. Our pipeline processes health records across 60+ language pairs, 30+ formats, handwritten notes, portal exports, photos of paper.
So we knew we could build a smaller, local solution for those who need it now. Canonizr is your missing data processing and normalisation layer it cleans, structures, and prepares inputs before they reach the model. It parses more file types accurately than Anthropic's own handling, so check it out.
Drop in a PDF, a Word document, a spreadsheet, a scanned image, a legacy format Canonizr converts it to clean markdown. Not a model's best guess at the content. The actual structure: tables intact, charts extracted, headings preserved.
If you're a developer whose agent quality degraded last week and you don't know how to fix it, start with the inputs. If you want to help us build this, the repo is open. Contributions welcome. Please, check our launch page today!

