The .join() That Should Be a Bug
Serving thousands of connections on a backend where every call blocks
Section titled “Serving thousands of connections on a backend where every call blocks”There are two well-known ways to implement connection management in database systems. Each way has a cost. Kronotop takes a third approach. Let’s start with the problem we have.
Kronotop stores its metadata in FoundationDB. It also stores document bodies on the local file system. So almost every operation a client asks for turns into a network call to FDB and a read from the disk. A network call takes milliseconds, not nanoseconds. Reading a document, committing a transaction, looking up an index: all of it waits on I/O. Our command path is not in-memory work. It is dominated by waiting.
Two classic models
Section titled “Two classic models”Redis(before 6.0) runs a single thread. That thread listens on every connection and processes commands one at a time. This scales connections well, but it forbids blocking. Nothing in the command path is allowed to wait. If one command waited, every other client would wait with it. That is fine when all your data is in memory. It does not work when a command has to do more. Our commands call a transactional store over the network, and they append data to a file on disk and call fsync. Every one of those steps blocks.
Postgres goes the other way. It gives each connection its own process. Now blocking is not a problem. Each connection runs on its own, so you can write plain, sequential code. But there is a price! A connection is now an operating system process, and that is heavy. A few thousand connections become too much. This is why Postgres deployments almost always sit behind a separate connection pooler.
One model scales connections but cannot wait. The other can wait but does not scale large numbers of idle/open connections cheaply. Kronotop needs both.
Splitting the connection from the work
Section titled “Splitting the connection from the work”The trick is to stop treating “the connection” and “the work” as the same thing.
The connection side works the way Redis does. We build it on Netty. A small number of event loop threads listen on many sockets. They react to whatever socket is ready. They parse the incoming command and write the reply back. This part never blocks and never waits. So a handful of threads can keep thousands of connections alive.
The work side is the part that does disk and network I/O. It runs somewhere else, on a virtual thread. A virtual thread lets us write the I/O call as plain, blocking, top-to-bottom code. It is the Postgres style, but without the Postgres cost. The virtual thread blocks while it waits on FoundationDB or the disk. When it blocks, the Java runtime unmounts it and frees the carrier thread underneath for other work. Thousands of virtual threads can wait on I/O at the same time, while only a few real threads do anything. When the I/O completes, the virtual thread picks up where it left off.
So the network threads stay free. They can serve every other connection. The slow part happens off to the side, where waiting is cheap. The result is then handed back to the connection thread, which writes the reply. We keep the write there by design. That hand-off is the one rule we keep strict.
In code, the whole offload comes down to two moving parts:
CompletableFuture .supplyAsync(supplier, context.getVirtualThreadPerTaskExecutor()) // run on a virtual thread .thenAcceptAsync(action, response.getCtx().executor()); // resume on the Netty threadTwo executors, two phases. The supplier is the slow part. It runs on the virtual thread executor, where blocking on FoundationDB is allowed. The action is the reply. It runs on response.getCtx().executor(), the Netty event loop that owns this connection. So the work that waits and the write that must not move stay on separate threads. The second only starts once the first is done.
A read command is written against exactly that shape. The first block fetches the value, the second sends it:
public void execute(Request request, Response response) { AsyncCommandExecutor.supplyAsync(context, response, () -> { // virtual thread: open the transaction and wait on FDB Session session = request.getSession(); Transaction tr = TransactionUtil.getOrCreateTransaction(context, session); DirectorySubspace subspace = openZMapSubspace(tr, session); return tr.get(subspace.pack(message.getKey())).join(); }, (value) -> { // Netty thread: write the reply response.writeFullBulkString(toMessage(value)); });}The .join() in the middle is the interesting line. On a normal thread it would be a waste. The thread would sit there, blocked and useless. On a virtual thread it is the whole point.
The runtime parks the call and frees the real thread. It resumes the call when FDB answers. The handler reads top to bottom, like plain blocking code, and the cost of waiting is gone.
Once the supplier returns, the executor moves to the reply phase. If the command was an auto-commit, it cleans up the transaction. If something threw, it writes the error back to the client.
You can see the relevant implementation parts from the following links:
What lives on a connection
Section titled “What lives on a connection”A connection carries a small amount of state while it is open. It knows which client it is, which namespaces it has open, and whether it has authenticated. It also holds the details of any transaction in flight.
That last part matters. In Kronotop, a transaction belongs to the connection. You begin it, run several commands inside it, and then commit or roll it back. The whole time, that transaction is tied to your session. If your connection drops in the middle, the open transaction is cancelled and cleaned up. Nothing is left dangling. You can also reset your session state without dropping the connection. This clears cursors, watched keys, and any half-finished transaction in one step. That reset is also what makes a connection easy to reuse in a client-side pool. You hand it back, reset it, and hand it out again.
Try Kronotop
Section titled “Try Kronotop”If you read this post until the end, why not try Kronotop?
Kronotop is a distributed, transactional document database built on FoundationDB. See the Quickstart guide to start a cluster in a few minutes.