Launched this week

Canonizr

Launched this week

Precise document extraction for your agents — zero retention

38 followers

Precise document extraction for your agents — zero retention

38 followers

Visit website

Accurate document parsing for high quality outputs. Upload any file — PDFs, legacy Word docs, scanned, multilingual, handwritten, chart-heavy and get clean text out — no single word silently dropped — so your pipelines don’t break when models or policies change. We extract and normalise all your files data so you can plug it straight into OpenClaw or any other agent, LLM or pipeline. Zero data retention. Encrypted in transit and at rest. Use open-source or hosted.

Free

Launch tags:API•Open Source•Alpha

Launch Team / Built With

getviktor.com — An AI coworker that actually does the work

An AI coworker that actually does the work

Promoted

Maker

📌

Hi, I’m Maria! We built Canonizr and made it open source because document pipelines shouldn’t depend on one provider’s pricing decisions.

We already had complex data extraction reliably solved for our Health Data Avatar (multi-language, messy, high-stakes), and were planning to make it available for everyone one day. But last week accelerated things.

A lot of you lost workflows you’d built carefully. $200/month became $1,000–5,000/month overnight, with 24 hours’ notice — for the exact same usage. Then you migrated. And the document quality tanked.

What our team has been highlighting all this time: the real bottleneck is almost always unreliable, suboptimal data extraction — especially for complex formats — because language models weren’t built for layout parsing or precise data extraction. And many don’t even notice it, because randomly missing 5% of a PDF can still be acceptable for some use cases.

You often don’t even see what your agent actually received.

Scanned PDFs with mixed columns: traditional OCR transposes numbers.
Multilingual documents with Arabic: RTL text silently reverses.
Tables in financial reports: cells flatten into linear text, rows merge, meaning inverts.

We don’t think anyone should accept that compromise. If your documents aren’t properly structured, models miss information, outputs degrade, and costs explode.

Canonizr is a model-agnostic file parsing layer. Drop any file — 30+ formats including the ones LLMs struggle with. Get structured, clean output. Works with Claude, GPT-4o, Gemini, Llama, whatever you run next year when the landscape shifts again. Runs locally — your documents never leave your environment. Built-in PII detection so you can redact before you ever hit an LLM call.

Two ways to run Canonizr:

Local (free, open-source): One command installs everything — Docling, LibreOffice, Gemma 4, zero external calls. GDPR-compliant by architecture. Your documents never leave your hardware. MIT licence, fork it, own it.

Hosted API: We handle the infrastructure. You send documents and get back structured context. Zero retention — documents are deleted after parsing. Encrypted in transit and at rest.

Would love to hear:
what broke in your workflows this week?

Report

6d ago

Scade.pro

@maria_sergeeva1 I’ve been procrastinating on my old accounting documents forever and it sounds like I can finally trust they could be parsed reliably.
Please, ping me when you have the API running 😊

Report

5d ago

Maker

@maria_sergeeva1 @maria_anosova Thank you - we absolutely will (and there's a signup form on the webpage for anyone else reading this). If you have lots of the same kind of data then we may be interested in having you as a tester, so we'll be in touch about that too!

Report

5d ago

@maria_sergeeva1 It pulls docs for agents with zero retention - does it really delete everything right after?

Report

5d ago

Maker

@julia_zakharova2 great question, thank you! If you’re running the local version, everything stays on your device anyway. If you decide to use our API, your Data is deleted right after the job is completed and there’s a time limit. So if for some reason the task wasn’t completed within the time frame (currently it’s 15 min, but we may need to extend it to an hour after we run API tests at scale and see that it’s unrealistic for some tasks) the user’s data is deleted.

Report

5d ago

ProdShort

For parsing, can I give a prompt to describe the format I want and what I need or not form the content?
An do you also parse video ?

Report

5d ago

Maker

@bengeekly For image transcription there are currently two prompts loaded in as templates (one for transcription, one for image captioning) -- making it configurable is a great suggestion, thank you!

We don't support structured output directly, because the expected user has access to an LLM that is better at turning unstructured text into structured data. Many users don't realise that sending a document isn't the same thing as sending unstructured text (not helped by some handlers doing low quality text extraction). Canonizr is focusing on document -> text.

Video is planned: Essentially it is treated as an audio layer (transcribed) plus a sequence of images (extracted frames, one per second seems standard, but this may also need to be configured based on user hardware).

The goal is to make it firmly opinionated (very few endpoints), while also being configurable.

Report

5d ago

Is there an API? We need to recognize scanned documents in one of our projects.

Report

5d ago

Maker

@mykyta_semenov_ Coming soon! Thank you for your interest, and you can also sign up for the release email through canonizr.com

Report

5d ago

@maria_sergeeva1 Super timely.

Claude changes broke a lot of doc-based workflows - this feels like a clean way to restore that without extra infra.

Keen to try it on a few messy PDFs 👀
I also like the fact I don't have to do the setup myself! Such a time saving for me...!

Report

5d ago

Maker

@maria_sergeeva1 @florian_piquemal Perfect use case, then! Please give it a go, and then if you have any feedback then you can reach us here, through GitHub, or through the Contact Us links.

Report

5d ago

Maker

Hi, I'm Hex! We're bringing you the solution to your broken OpenClaw pipelines ASAP. You can run Canonizr locally right now - it's free and open source. The API will be live later today for those of you running on restricted hardware and locked-down environments.

Report

5d ago

Maker

Check us out on GitHub to use Canonizr locally for free!

Report

5d ago

Forum Threads

p/canonizr

•

5d ago

Boost your OpenClaw with accurate data extraction for free

We rushed our open source solution for reliable document processing onto Product Hunt today, a few minutes before the scheduled time, accepting we would sacrifice getting featured. It felt essential to share it ASAP, so that the builders can benefit from it free and locally while it hurts the most.

Anthropic changed its pricing structure on April 4th. Overnight, the cost of running Claude on carefully built agent pipelines became untenable. The practical response, for most, was to downgrade to cheaper models. The quality of outputs dropped noticeably, partly because LLMs weren't built for parsing documents, so they try to read any string in the file they find.
Garbage in, garbage out.
(Claude was different for PDF processing, using full multimodal handling: each page is rasterised to PNG.)
We'd already solved the problem of reliable complex data processing for Health Data Avatar where a parsing error can be fatal. Our pipeline processes health records across 60+ language pairs, 30+ formats, handwritten notes, portal exports, photos of paper.
So we knew we could build a smaller, local solution for those who need it now. Canonizr is your missing data processing and normalisation layer it cleans, structures, and prepares inputs before they reach the model. It parses more file types accurately than Anthropic's own handling, so check it out.
Drop in a PDF, a Word document, a spreadsheet, a scanned image, a legacy format Canonizr converts it to clean markdown. Not a model's best guess at the content. The actual structure: tables intact, charts extracted, headings preserved.
If you're a developer whose agent quality degraded last week and you don't know how to fix it, start with the inputs. If you want to help us build this, the repo is open. Contributions welcome. Please, check our launch page today!

View all

Hi, I’m Maria! We built Canonizr and made it open source because document pipelines shouldn’t depend on one provider’s pricing decisions.

You often don’t even see what your agent actually received.

We don’t think anyone should accept that compromise. If your documents aren’t properly structured, models miss information, outputs degrade, and costs explode.