We rushed our open source solution for reliable document processing onto Product Hunt today, a few minutes before the scheduled time, accepting we would sacrifice getting featured. It felt essential to share it ASAP, so that the builders can benefit from it free and locally while it hurts the most.
Anthropic changed its pricing structure on April 4th. Overnight, the cost of running Claude on carefully built agent pipelines became untenable. The practical response, for most, was to downgrade to cheaper models. The quality of outputs dropped noticeably, partly because LLMs weren't built for parsing documents, so they try to read any string in the file they find.
Garbage in, garbage out.
(Claude was different for PDF processing, using full multimodal handling: each page is rasterised to PNG.)
We'd already solved the problem of reliable complex data processing for Health Data Avatar where a parsing error can be fatal. Our pipeline processes health records across 60+ language pairs, 30+ formats, handwritten notes, portal exports, photos of paper.
So we knew we could build a smaller, local solution for those who need it now. Canonizr is your missing data processing and normalisation layer it cleans, structures, and prepares inputs before they reach the model. It parses more file types accurately than Anthropic's own handling, so check it out.
Drop in a PDF, a Word document, a spreadsheet, a scanned image, a legacy format Canonizr converts it to clean markdown. Not a model's best guess at the content. The actual structure: tables intact, charts extracted, headings preserved.
If you're a developer whose agent quality degraded last week and you don't know how to fix it, start with the inputs. If you want to help us build this, the repo is open. Contributions welcome. Please, check our launch page today!
Hi, I’m Maria! We built Canonizr and made it open source because document pipelines shouldn’t depend on one provider’s pricing decisions.
We already had complex data extraction reliably solved for our Health Data Avatar (multi-language, messy, high-stakes), and were planning to make it available for everyone one day. But last week accelerated things.
A lot of you lost workflows you’d built carefully. $200/month became $1,000–5,000/month overnight, with 24 hours’ notice — for the exact same usage. Then you migrated. And the document quality tanked.
What our team has been highlighting all this time: the real bottleneck is almost always unreliable, suboptimal data extraction — especially for complex formats — because language models weren’t built for layout parsing or precise data extraction. And many don’t even notice it, because randomly missing 5% of a PDF can still be acceptable for some use cases.
You often don’t even see what your agent actually received.
Scanned PDFs with mixed columns: traditional OCR transposes numbers.
Multilingual documents with Arabic: RTL text silently reverses.
Tables in financial reports: cells flatten into linear text, rows merge, meaning inverts.
We don’t think anyone should accept that compromise. If your documents aren’t properly structured, models miss information, outputs degrade, and costs explode.
Canonizr is a model-agnostic file parsing layer. Drop any file — 30+ formats including the ones LLMs struggle with. Get structured, clean output. Works with Claude, GPT-4o, Gemini, Llama, whatever you run next year when the landscape shifts again. Runs locally — your documents never leave your environment. Built-in PII detection so you can redact before you ever hit an LLM call.
Two ways to run Canonizr:
Local (free, open-source): One command installs everything — Docling, LibreOffice, Gemma 4, zero external calls. GDPR-compliant by architecture. Your documents never leave your hardware. MIT licence, fork it, own it.
Hosted API: We handle the infrastructure. You send documents and get back structured context. Zero retention — documents are deleted after parsing. Encrypted in transit and at rest.
Would love to hear:
what broke in your workflows this week?
Scade.pro
@maria_sergeeva1 I’ve been procrastinating on my old accounting documents forever and it sounds like I can finally trust they could be parsed reliably.
Please, ping me when you have the API running 😊
@maria_sergeeva1 @maria_anosova Thank you - we absolutely will (and there's a signup form on the webpage for anyone else reading this). If you have lots of the same kind of data then we may be interested in having you as a tester, so we'll be in touch about that too!
@maria_sergeeva1 It pulls docs for agents with zero retention - does it really delete everything right after?
ProdShort
For parsing, can I give a prompt to describe the format I want and what I need or not form the content?
An do you also parse video ?
@bengeekly For image transcription there are currently two prompts loaded in as templates (one for transcription, one for image captioning) -- making it configurable is a great suggestion, thank you!
We don't support structured output directly, because the expected user has access to an LLM that is better at turning unstructured text into structured data. Many users don't realise that sending a document isn't the same thing as sending unstructured text (not helped by some handlers doing low quality text extraction). Canonizr is focusing on document -> text.
Video is planned: Essentially it is treated as an audio layer (transcribed) plus a sequence of images (extracted frames, one per second seems standard, but this may also need to be configured based on user hardware).
The goal is to make it firmly opinionated (very few endpoints), while also being configurable.
Is there an API? We need to recognize scanned documents in one of our projects.
@mykyta_semenov_ Coming soon! Thank you for your interest, and you can also sign up for the release email through canonizr.com
@maria_sergeeva1 Super timely.
Claude changes broke a lot of doc-based workflows - this feels like a clean way to restore that without extra infra.
Keen to try it on a few messy PDFs 👀
I also like the fact I don't have to do the setup myself! Such a time saving for me...!
@maria_sergeeva1 @florian_piquemal Perfect use case, then! Please give it a go, and then if you have any feedback then you can reach us here, through GitHub, or through the Contact Us links.
Hi, I'm Hex! We're bringing you the solution to your broken OpenClaw pipelines ASAP. You can run Canonizr locally right now - it's free and open source. The API will be live later today for those of you running on restricted hardware and locked-down environments.
Check us out on GitHub to use Canonizr locally for free!