DataSieve helps you turn unstructured text into clean, usable data in seconds.
Drop in text, files, folders, or even archives, and extract what you need in one pass. Emails, phone numbers, URLs, dates, financial data, and more. Everything runs locally on your device, with no cloud and no tracking.
What you can do
- Extract multiple data types at once
- Process text, PDFs, EPUBs, CSV, JSON, Word files, and more
- Export results to JSON, XLSX, DOCX, and more
- Define your own custom extractors
Hey everyone,
I’m the developer behind DataSieve (previously TextMine). This update has been a big step forward compared to the first version.
The main focus for 2.x was flexibility and scale. Being able to scan folders and archives, and define custom extractors, makes it much more useful for real workflows instead of just one-off text inputs.
I also spent time improving extraction accuracy for more complex data types like financial info and international formats.
Happy to answer any questions, and I’d really appreciate any feedback, especially around usability and edge cases.
Report
@albemala What about PDFs were the data is an image instead of actual text.
Example a financial report scanned as an img pdf instead of as a document....
@jonathan_alonso That's not supported yet. I'm planning to support extracting data from images in the next major release, and that will cover also images inside PDFs.
Report
Hi Alberto, I like your idea of running everything locally. Is the list of attributes to extract static, or can I define custom ones?
Does it work on external websites? For example, I need to extract the names of all geographical objects mentioned across several websites. Can your system do that?
@natalia_iankovych Not yet. Extracting data from websites is something I'm planning to add in a future major release. Stay tuned!
Report
Nice — structured data extraction is one of those problems that sounds simple until you actually try it. How does it handle ambiguous fields? For example, does it distinguish between a phone number and a fax number in unstructured text? Asking because I work on a similar challenge with voice-to-form mapping.
Replies
Vologram Messages—Amaze, Engage, Connect
Vologram Messages—Amaze, Engage, Connect
@jonathan_alonso That's not supported yet. I'm planning to support extracting data from images in the next major release, and that will cover also images inside PDFs.
Vologram Messages—Amaze, Engage, Connect
@alberto_polini You can define custom ones using regexes!
Does it work on external websites? For example, I need to extract the names of all geographical objects mentioned across several websites. Can your system do that?
Vologram Messages—Amaze, Engage, Connect
@natalia_iankovych Not yet. Extracting data from websites is something I'm planning to add in a future major release. Stay tuned!
Nice — structured data extraction is one of those problems that sounds simple until you actually try it. How does it handle ambiguous fields? For example, does it distinguish between a phone number and a fax number in unstructured text? Asking because I work on a similar challenge with voice-to-form mapping.
Vologram Messages—Amaze, Engage, Connect
@webappski as of now, the app doesn't distinguish between phone numbers and fax numbers. Structured data extraction is not easy indeed!