Skip to content

Architecture

RESP-compatible clients talk to a Kronotop member over TCP. The member hosts the data models: Bucket for documents, ZMap for ordered key-value data. Both share the same sessions, namespaces, and transactions. Underneath, FoundationDB stores metadata, indexes, ZMap data, and cluster state, while Volume, Kronotop’s storage engine, keeps document bodies on the member’s local disk.

The rest of this page explains what Kronotop delegates to FoundationDB, how data is split into shards, and how shards are replicated.

RESP-compatible clients Kronotop member strictly serializable transactions · namespaces Bucket documents · BQL · indexes ZMap ordered key-value Volume segments standby member Volume segments replication FoundationDB metadata · indexes · ZMap data · cluster state

Kronotop speaks RESP2 and RESP3 and works with existing RESP-compatible clients. There is no separate query endpoint or admin protocol; everything, including cluster administration, is a RESP command. Each member listens on two ports: one for clients, one for cluster administration.

Every client connection is bound to a session. The session holds its attributes, its current namespace, the active transaction if one is open, and its cursors.

Every Kronotop transaction is a FoundationDB transaction; strict serializability, conflict detection, and the ordered keyspace come from FoundationDB. Kronotop adds the RESP front end, the document layer with its query engine and indexes, and the Volume storage engine on top.

FoundationDB holds everything that must be transactional and small: bucket metadata and index entries, ZMap data, namespace directories, volume metadata, and cluster state. Document bodies are the exception. FoundationDB is optimized for small key-value pairs and enforces a 100 KB value-size limit, which document bodies routinely exceed. Volume offloads them to append-only segment files on the member’s local disk and keeps only pointers in FoundationDB.

The write path preserves transactional guarantees across the split: a document body is appended to a segment file and flushed to disk first; only after the flush succeeds is the metadata committed to FoundationDB. Metadata never references content that has not been persisted.

Bucket data is partitioned into shards. Each shard owns exactly one volume, named after the shard (bucket-shard-0, bucket-shard-1, and so on). A bucket spans one or more shards, so its documents are distributed across one or more volumes. Within a volume, each bucket’s data is isolated by a prefix.

Shard ownership is assigned through cluster routing: one member is the primary and serves writes, standby members replicate from it and can be promoted. Each shard also carries a status that controls traffic: READWRITE, READONLY, or INOPERABLE. Routing and status are managed with the admin command interface.

ZMap data is not sharded by Kronotop. It lives directly in FoundationDB, which partitions its own keyspace automatically.

Volume replication is asynchronous and primary-to-standby. Each shard’s volume is replicated independently: standbys pull from the primary, first copying existing segment data in chunks until they reach the primary’s current write position, then streaming incremental changes from a changelog maintained in FoundationDB. Replication progress is also persisted in FoundationDB, so a standby can restart and resume exactly where it left off.

Replication starts automatically when a standby is assigned through cluster routing. Promoting a standby and reassigning shards are explicit operator actions. See Volume replication for the mechanics and the operations guide for the commands.

All coordination state goes through FoundationDB; there is no separate consensus layer or gossip protocol. Members do not talk to each other to agree on cluster state, they read and write it in FoundationDB.

Failure detection works the same way. Each member periodically increments a heartbeat counter in FoundationDB, and other members expect that counter to keep advancing. A member whose counter stalls beyond a configured silent period is suspected dead. This is a local judgement made by each observer, not a cluster-wide consensus, and a suspected member drops off the list as soon as its heartbeats resume. See health monitoring for the details.