And now for something completely different: IngoDB

With everyone else vibe coding their favorite database from scratch these days, I wanted in on the fun. So I took a few nights off from the projects I really should be working on at Nyrkiö, and wrote a small database engine from scratch.

Sunny Bains (InnoDB veteran, now at PingCAP) has been building database components in Rust including a WAL implementation. Laurynas Biveinis (VilniusDB, but really works as a contractor on Meta’s MyRocks team) did a sweeping refactor of code and data structures in MySQL code base that date back to the 80s. Baotiao implemented the InnoDB doublewrite buffer concept in PostgreSQL — with impressive results: 2x write throughput compared to PostgreSQL’s full page writes. And a friend who shall remain anonymous has reimplemented Cassandra in Rust.

Rather than re-implementing MongoDB in Rust… (isn’t that the obvious projection from those aforementioned projects?) …I started brainstorming with Gemini on a database design that is only possible because of AI — and whose primary users going forward will also be AIs, not humans.

And this is what Claude and yours truly came up with in the past week.

The Question

What if the storage engine could watch what queries are slow, and adapt to that? Kind of like “a DBA runs EXPLAIN, notices a missing index, and types CREATE INDEX.” But instead the engine observes that a particular filter query scans 100,000 documents to return 100, and automatically builds an index on the filter field. Without anyone asking.

That’s IngoDB. Well for starters…

What Is It?

IngoDB is an adaptive, self-morphing document storage engine written in Rust. It’s an LSM-tree engine (like RocksDB, Cassandra, or LevelDB) with a twist: it tracks its own query performance and reactively restructures its physical data layout based on what it observes.

Key features as of this week:

Liquid AST query interface — No SQL parser. Queries are Rust data structures (enums). The intended consumer is an AI agent or application code, not a person typing at a terminal. (My friend Kaj Arnö might provide a SQL frontend later. I leave it as an exercise for the reader to suggest a name for that product.)
Graph traversal as join-by-value — There are no foreign keys, no special pointer types. Any field can become a graph edge at query time. Traverse { from_field: "user_id", to_field: "_id", depth: 2 } follows relationships that are discovered, not declared.
MVCC snapshot isolation — Multi-version concurrency control with UUIDv7 versions. Take a snapshot, do reads, see a consistent point-in-time view while writers continue in parallel.
UCS-inspired compaction — Unified Compaction Strategy adapted from Cassandra’s UCS (designed by my former colleague Branimir Lambov at DataStax). A single scaling parameter W controls the read/write amplification tradeoff, with background multi-threaded compaction.
Reactive secondary indexes — This is the core differentiator. Sort queries automatically spill to disk as secondary index SSTables. Filter queries with low selectivity trigger index creation after 2 executions. Indexes use the same SSTable format as primary storage — just sorted by different fields. Partial index ranges evolve through compaction.
170 tests — Every feature has tests. The engine is young but not fragile.

How It Learns

The feedback loop:

Query executes
→ Stats recorded (fields, latency, docs scanned vs returned)
→ Pattern detected (repeated filter with low selectivity)
→ Index created reactively (SSTable sorted by the filter field)
→ Next query uses the index (binary search instead of full scan)
→ Compaction maintains and evolves the index over time

The engine decides when to create an index. Right now everything is very hard coded, but the original idea was to let an AI (neural network, yes, large language model, no) to actively add and remove indexes, LSM levels and other optimizations on the physical storage level. In practice the simple rules I put in place now while developing, kind of take you quite far:

If you think about it, if you don’t have an index on field1, and the user wants you to return records sorted by field1…. Well at the end of that operation you have it. So you can write the sorted result set to disk essentially for free. It is only later, during the next compaction, that you need to decide whether that index is worth keeping at all.

No human DBA involved in schema design anymore. The engine just watches and adapts.

Performance (So Far)

Notes: There is a WAL but it’s not calling fsync yet. Group commit is hard even for a superhuman AI… There is MVCC snapshot isolation, but by default Claude didn’t use it. The below results are read committed MVCC.

1M Products, batch writes + double-buffer memtable + adaptive W

Batch writes (1000 docs/batch), double-buffered memtable, adaptive W (unlimited step).

Phase	Metric	Value
Ingest	1M docs (batch=1000)	210-235K docs/sec sustained
Ingest total		4.0s (was 42s before O(N) fix + double-buffer)
Updates	1M random	starts 230K, settles ~120K during compaction
Compaction settle	after updates	7.7s → 2 SSTables
Point lookups	20K gets (2 SSTables)	56K ops/sec, p50=14µs
First scan (no index)	category filter, 100K results	3.9s
Second scan (index, but lookup to primary index)		1.8s (2.2x speedup)
8-thread concurrent		413K ops/sec
Mixed read/write		72K ops/sec
Pure reads	2M gets	51K ops/sec

Adaptive W journey: 0 → 8 (writes) → -8 (scans) → -3 (mixed) → -8 (reads). 3 compaction runs, 586 MB read, 423 MB written, WA=0.72x.

Not bad for a week of work. Not production-ready either. But the architecture is sound.

Why AI-Native?

1. The query interface is for machines. SQL is a human language. IngoDB’s Liquid AST is a data structure. An AI agent constructing Query::Scan { filter: Some(Filter::Gt { field: "age".into(), value: Value::U64(30) }), ... } is the intended use case. No parsing ambiguity, no SQL dialects, no impedance mismatch. The query IS the program.

2. The plan is to use AI inside the database for query optimization, index creation and other optimizations

3. The engine is built by AI. IngoDB was implemented in a continuous dialogue between me and Claude (Anthropic’s coding agent). The entire codebase — 170 tests, 7 crates, ~10K lines — was produced in about a week of pair programming sessions. Claude wrote the code; I made the architectural decisions, caught the design flaws, and steered the direction. (For example: “Why does the content hash not include the _id? If those bytes get zeroed by corruption, you’d silently produce a valid-looking tombstone.” That caught a real bug.)

This isn’t a toy. It has MVCC, background compaction, crash-safe index metadata, and a real UCS implementation. It’s also not done. But the velocity of AI-assisted development makes it possible to build a serious storage engine in days instead of years.

The Origin Story

The full architectural vision is documented in a dialogue between me and Gemini that predates any code. That conversation produced the ideas: self-morphing LSM trees, reactive optimization during compaction, graph traversal as join. Claude turned those ideas into working code.

Henrik’s — MySQL/MariaDB architecture, MongoDB performance engineering, managing the core database teams at DataStax (Cassandra) and CrateDB — all fed into this design. IngoDB is what happens when you take lessons from five different database engines and ask “what if the engine could learn?”

What’s Next

Use AI to make optimization decisions, like when to add or drop an index
Semantic shredding: Extract hot fields into columnar structures during compaction
Co-location: Group related documents together based on observed traversal patterns
Custom comparators: WASM-based sort functions for user-defined ordering
gRPC server: Network-accessible Liquid AST protocol

The code is at github.com/nyrkio/ingodb. Documentation is in docs/. Benchmarks are reproducible with cargo run --release --example benchmark.