Skip to content

Dislog Implementation Details

Pouria Amini edited this page Mar 24, 2023 · 2 revisions

Why Dislog?

Folks who develop storage engines of filesystems and databases use logs to improve the data integrity of their systems. The ext filesystems, for example, log changes to a journal instead of directly changing the disk’s data file. Once the filesystem has safely written the changes to the journal, it then applies those changes to the data files. Logging to the journal is simple and fast, so there’s little chance of losing data. Even if your computer crashed before ext had finished updating the disk files, then on the next boot, the filesystem would process the data in the journal to complete its updates. Database developers, like PostgreSQL, use the same technique to make their systems durable: they record changes to a log, called a write-ahead log (WAL), and later process the WAL to apply the changes to their database’s data files.

Database developers use the WAL for replication, too. Instead of writing the logs to a disk, they write the logs over the network to its replicas. The replicas apply the changes to their own data copies, and eventually they all end up at the same state. Raft, a consensus algorithm, uses the same idea to get distributed services to agree on a cluster-wide state. Each node in a Raft cluster runs a state machine with a log as its input. The leader of the Raft cluster appends changes to its followers’ logs. Since the state machines use the logs as input and because the logs have the same records in the same order, all the services end up with the same state.

Web front-end developers use logs to help manage state in their applications. In Redux, a popular JavaScript library commonly used with React, you log changes as plain objects and handle those changes with pure functions that apply the updates to your application’s state. All these examples use logs to store, share, and process ordered data. This is really cool because the same tool helps replicate databases, coordinate distributed services, and manage state in front-end applications. You can solve a lot of problems, especially in distributed services, by breaking down the changes in your system until they’re single, atomic operations that you can store, share, and process with a log.

Databases often provide a way to restore their state to some time in the past, often referred to as point-in-time recovery. You take a snapshot of your database from the past and then replay the logs from the write-ahead log until it’s at the point in time you want. You don’t need the snapshot if you have every single log since the beginning to replay, but for databases with long histories and a lot of changes, keeping every log isn’t feasible. Redux uses the same idea to undo/redo actions: it logs the application’s state after each action and undoing an action just requires Redux to move the state shown in the UI to the previously logged state. Distributed version control systems like Git work similarly; your commit log history is a literal commit log.

As you can see, a complete log not only holds the latest state, but all states that have existed, which allows you to build some cool features that you’d find complicated to build otherwise. Logs are simple—and that’s why they’re good. That's why Dislog exists.

Dislog's Inner Structure

A log is an append-only sequence of records. You append records to the end of the log, and you typically read top to bottom, oldest to newest—similar to running tail -f on a file. You can log any data. People have historically used the term logs to refer to lines of text meant for humans to read, but that’s changed as more people use log systems where their “logs” are binary- encoded messages meant for other programs to read. When I talk about logs and records in this book, I’m not talking about any particular type of data. When you append a record to a log, the log assigns the record a unique and sequential offset number that acts like the ID for that record. A log is like a table that always orders the records by time and indexes each record by its offset and time created.

Concrete implementations of logs have to deal with us not having disks with infinite space, which means we can’t append to the same file forever. So we split the log into a list of segments. When the log grows too big, we free up disk space by deleting old segments whose data we’ve already processed or archived. This cleaning up of old segments can run in a background process while our service can still produce to the active (newest) segment and consume from other segments with no, or at least fewer, conflicts where goroutines access the same data. There’s always one special segment among the list of segments, and that’s the active segment. We call it the active segment because it’s the only segment we actively write to. When we’ve filled the active segment, we create a new segment and make it the active segment.

Each segment comprises a store file and an index file. The segment’s store file is where we store the record data; we continually append records to this file. The segment’s index file is where we index each record in the store file. The index file speeds up reads because it maps record offsets to their position in the store file. Reading a record given its offset is a two-step process: first, you get the entry from the index file for the record, which tells you the position of the record in the store file, and then you read the record at that position in the store file. Since the index file requires only two small fields—the offset and stored position of the record—the index file is much smaller than the store file that stores all your record data. Index files are small enough that we can memory-map them and make operations on the file as fast as operating on in-memory data.

Now you know how Dislog works.

How Dislog Serves Requests with gRPC

Most libraries out there can only be used on a single computer by a single person at a time. Plus that person has to learn that library’s API, run the code, and use the library's implementation to solve a problem—none of which most people will do, which limits the user base. We solved these problems by turning Dislog into a web service. Compared to a program that runs on a single computer, networked services provide three major advantages:

  • You can run it across multiple computers for availability and scalability.
  • It allows multiple people to interact with the same data.
  • It provides accessible interfaces that are easy for people to use.

We've made a service that allows multiple people to interact with the same data and runs across multiple computers. The best tool I’ve found for serving requests across distributed services is Google’s gRPC.

A gRPC service is essentially a group of related RPC endpoints—exactly how they’re related is up to you. A common example is a RESTful grouping where the relation is that the endpoints operate on the same resource, but the grouping could be looser than that. In general, it’s just a group of endpoints needed to solve some problem. In our case, the goal is to enable people to write to and read from their log. Creating a gRPC service involved defining it in protobuf and then compiling the protocol buffers into code comprising the client and server stubs that you then implement.

Secured Services

Security in Dislog can be broken down into three sections:

  1. Encrypt data in-flight to protect against man-in-the-middle attacks;
  2. Authenticate to identify clients; and
  3. Authorize to determine the permissions of the identified clients.

The process by which a client and server communicate is kicked off by a TLS handshake. During this handshake, the client and server:

  1. Specify which version of TLS they’ll use;
  2. Decide which cipher suites (the set of encryption algorithms) they’ll use;
  3. Authenticate the identity of the server via the server’s private key and the certificate authority’s digital signature; and
  4. Generate session keys for symmetric encryption after the handshake is complete. Once this handshake process is complete, the client and server can communicate securely.

Dislog uses TLS mutual authentication, also commonly referred to as two-way authentication, in which both the server and the client validate the other’s communication, which is more commonly used in machine-to-machine communication. In this setup, both the server and the client use a certificate to prove their identity.

The server and client authentication of the servers is implemented using self-hosted CA with CFSSL. Also, for authorization, Access Control Lists (ACL) are used. An ACL is a table of rules where each row says something like “Subject A is permitted to do action B on object C.”. Dislog uses Casbin which supports enforcing authorization based on various control models—including ACLs.

Server-to-Server Service Discovery on Dislog

Dislog implements service discovery using a library called Serf. Serf maintains cluster membership by using an efficient, lightweight gossip protocol to communicate between the service’s nodes. Unlike service registry projects like ZooKeeper and Consul, Serf doesn’t have a central-registry architectural style. Instead, each instance of the service in the cluster runs as a Serf node. These nodes exchange messages with each other in the same way a zombie apocalypse might occur: one infected zombie soon spreads to infect everyone else. With Serf, instead of a spreading zombie virus, you’re spreading information about the nodes in your cluster. You listen to Serf for messages about changes in the cluster and then handle them accordingly.

The implementation of service discovery for Dislog using Serf is as follows:

  1. Serf node on each server.
  2. Configured each Serf node with an address to listen on and accept connections from other Serf nodes.
  3. Configured each Serf node with addresses of other Serf nodes and join their cluster.
  4. Handled Serf’s cluster discovery events, such as when a node joins or fails in the cluster.

Serf is a lightweight tool that we can use for infinite use cases, but its API can be verbose when you have a specific problem to solve. The specific job the discovery layer solved was to tell us when a server joined or left the cluster and what its ID and address are with as little API as possible.

Coordinated Services with Consensus on Raft

A Raft cluster has one leader and the rest of the servers are followers. The leader maintains power by sending heartbeat requests to its followers, effectively saying: “I’m still here and I’m still the boss.” If the follower times out waiting for a heartbeat request from the leader, then the follower becomes a candidate and begins an election to decide the next leader. The candidate votes for itself and then requests votes from the followers. “The boss is gone! I’m the new boss, right?” If the candidate receives a majority of the votes, it becomes the leader, and it sends heartbeat requests to the followers to establish authority: “Hey y’all, new boss here.” Followers can become candidates simultaneously if they time out at the same time waiting for the leader’s heartbeats. They hold their own elections and the elections might not result in a new leader because of vote splitting. So they hold another election. Candidates will hold elections until there’s a winner that becomes the new leader.

Every Raft server has a term: a monotonically increasing integer that tells other servers how authoritative and current this server is. The servers’ terms act as a logical clock: a way to capture chronological and causal relationships in distributed systems, where real-time clocks are untrustworthy and unimportant. Each time a candidate begins an election, it increments its term. If the candidate wins the election and becomes the leader, the followers update their terms to match and the terms don’t change until the next election. Servers vote once per term for the first candidate that requests votes, as long as the candidate’s term is greater than the voters’. These conditions help prevent vote splits and ensure the voters elect an up-to-date leader.

Raft's Setup

A Raft instance comprises:

  • A finite-state machine that applies the commands you give Raft;
  • A log store where Raft stores those commands;
  • A stable store where Raft stores the cluster’s configuration—the servers in the cluster, their addresses, and so on;
  • A snapshot store where Raft stores compact snapshots of its data; and
  • A transport that Raft uses to connect with the server’s peers.