Farhan Syah

The problem with multi-model databases: Treating everything as a Document.

by

Ten years ago, the "database-per-microservice" / polyglot persistence model was the right way to build. But today, the bottleneck isn't the database engine —> it's the glue.

What is the "ETL Tax"?

The ETL (Extract, Transform, Load) Tax is the hidden, compounding cost of moving data between specialized databases in a microservice/polyglot architecture. When your data is fractured across Postgres, Redis, Pinecone, and Neo4j, you don't just pay for the databases — you pay a massive "tax" to keep them synchronized.

It impacts engineering teams in three fatal ways:

  1. Engineering Time: Developers stop building product features and instead become "plumbers," writing brittle sync scripts, Debezium connectors, and Kafka pipelines.

  2. Data Consistency & Staleness: Dealing with dual-write bugs, race conditions, and out-of-sync data (e.g., a user deletes their account in Postgres, but their embeddings still live in Pinecone).

  3. Network Latency: You cannot do complex, real-time AI queries (like traversing a Graph and doing a Vector search) if the engines have to communicate over a network boundary.

Trying to Escape from The Duct-Tape Architecture

I wanted out of the duct-tape architecture, so I looked into the current wave of "multi-model" databases.

I have massive respect for SurrealDB. They proved that you can consolidate workloads, and their Developer Experience (DX) is arguably the best in the industry right now. But as a developer looking to deeply optimize my stack, I hit an architectural wall when I looked under the hood.

The "Wrapper" Penalty

It's multi-model databases are essentially smart query/compute layers wrapped around generic KV storage backends (like RocksDB, TiKV, or FoundationDB).

This means that underneath the slick API, almost every type of data is ultimately serialized and stored as a Document/KV pair.

  • Have Time-Series/Analytics data? It's stored as a document.

  • Have Graph data? The nodes and edges are stored as documents.

The problem is that you cannot properly optimize database workloads if you don't control the underlying memory and disk layouts. If you want to do fast aggregations, you need a Columnar layout. If you want to do high-performance, deep graph traversals, you need a CSR (Compressed Sparse Row) layout. You simply cannot fake a true columnar engine or a native graph engine on top of a generic KV store without paying a severe performance penalty and relying heavily on network hops to the storage backend.

Building a Native Engine

I realized that if I wanted a consolidated database without sacrificing the raw performance of specialized engines, I had to build the storage layers from scratch.

I spent the last year building NodeDB. It’s a distributed, multi-model database written in Rust.

Instead of being a wrapper, NodeDB implements native, specialized storage engines that all live within the same memory space:

  • A true Columnar engine for Time-Series and analytics.

  • A native Graph engine using CSR layouts.

  • Native Vector / AI search indexing.

  • Relational / Document engines.

Because these engines share the same memory space, there are no internal network hops. You can execute a single query that does a semantic vector search, traverses a graph to find related entities, and filters by a relational tenant I at native speeds.

For those of you who have moved to multi-model databases like SurrealDB or ArangoDB, how has the performance held up at scale when doing heavy analytics or deep graph traversals? Does the convenience of a unified API outweigh the underlying KV-storage penalty?

113 views

Add a comment

Replies

Best
AbdulHafeez Sadon

So my typical workflow as BI & Analytics Engineer working with AI-analytics projects, normally I'll deploy 3 dbs. OLTP for the app itself (single source of truth: flat files) -> CDC to OLAP: business-ready data transformation (parquet) -> VectorDB (vectorized format) for AI/RAG. In my case, if I use NodeDB, how would my new architecture looks like? ACID compliant is very important for me in OLTP. For OLAP I do use Apache Iceberg table to take its ACID compliant feature.

Farhan Syah

@hafeezcae 


Great question. Here's how your architecture would look with NodeDB:

One binary replaces all three layers:

OLTP → NodeDB's Document (strict) or KV engine. Fully ACID — WAL with O_DIRECT + fsync before ack, Raft quorum if replicated, full BEGIN/COMMIT/ROLLBACK/SAVEPOINT. This is your source of truth.

  • OLAP → NodeDB's Columnar engine. Per-column codecs (ALP, FastLanes, FSST, Gorilla), block stats, predicate pushdown.

  • Vector/RAG → NodeDB's Vector engine. HNSW + SQ8/PQ, with built-in hybrid BM25 + vector fusion.

No CDC pipeline. After OLTP commits to WAL, an in-process Event Plane fans WriteEvents to the Columnar and Vector engines. Eventually consistent like CDC, but zero network, no Debezium/Kafka/Airbyte. Cross-engine queries use a shared snapshot watermark, so one SQL statement joining OLTP + Columnar + Vector sees a consistent point-in-time view.

On Iceberg specifically: NodeDB doesn't currently export to Iceberg — our columnar engine is internal storage, not an open table format. We're planning an Iceberg sink so you can publish curated tables downstream for Spark/Trino consumers, but it's not shipped yet.

So for your stack:

  • If you chose Iceberg purely for ACID on analytics → NodeDB's columnar engine gives you that via the snapshot watermark. You don't need Iceberg.

  • If you chose Iceberg because other teams/engines need to read the same files → keep Iceberg downstream of NodeDB until our sink ships. You still remove Pinecone, the CDC pipeline, and your OLTP database today.

Bottom line: NodeDB collapses OLTP + Vector + the CDC pipeline into one binary right now, with full ACID on the OLTP side. Iceberg interop is on the roadmap as an export sink, not as primary storage — because the whole point of NodeDB is that you don't need a separate OLAP database in the first place.

AbdulHafeez Sadon

@farhan_syah this is amazing, i can't imagine how much savings that would accumulate from cloud architecture. How do you plan for the cloud enablement i.e. terraform, Tencent, Alibaba Cloud, Cloudflare, AWS, GCP integration?

Farhan Syah

@hafeezcae 

On cloud enablement, we have implemented some of it already, but the general plan is:

  • Single static binary — one Rust binary, no Zookeeper, no sidecars, runs on any VM on AWS/GCP/Azure/Tencent/Alibaba/bare metal today.

  • Terraform module as the first official IaC target.

  • Kubernetes operator for teams already on K8s.

  • Object storage integration for the L2 cold tier across S3/GCS/R2/OSS/COS, with per-cloud auth (IAM, Workload Identity, RAM, CAM).

And NodeDB has Multi-Raft built in, so if you don't want the Kubernetes route, you don't need it. Each vShard is its own Raft group, so a plain Terraform-provisioned VM cluster gives you HA, auto-failover, and horizontal scaling without K8s in the picture at all. K8s is just one deployment option, not a requirement.

AbdulHafeez Sadon

@farhan_syah bismillah, lessgoo. super pump to implement nodedb on my next projects. As for containerization, i know docker is pretty solid. but would you consider adding support for Podman as well? considering it integrates natively with k8s and fully OCI-compliant. Daemonless even, so better security.