Database Durability

There are a few common software techniques that are used to improve the fault tolerance of a distributed database. These techniques ensures that the database can continue to serve read/write requests during a partial failure and also recover from the failure without any loss in data. Note: the listed techniques below mainly apply to disk databases and are less common in in-memory databases.

Replication

Replicate the data across different nodes. In case of failure of any nodes, the database can still use the other nodes to continue serving user requests for read and write operation. MongoDB is an example where user requests to perform write operations are replicated across multiple nodes asynchronously by default. This means data is eventually consistent (a delay in few ms to replicate updated data to majority of nodes) unless explicitly mentioned by user to return strongly consistent data which comes at the cost of latency.

Erasure Coding

Instead of naively replicating the entire data across different nodes, erasure coding technique allows us to split the data into chunks and distribute that across the nodes. We can reconstruct the data from the chunks when a request is received. This would reduce the storage overhead needed for storing all of the data in disk across the nodes.

Erasure coding also has one more major advantage, depending on the encoding scheme used to generate the chunks from the data we can have a higher tolerance in terms of number of nodes failed. For example, if we have data D split into 4 chunks (D₁, D₂, D₃, D₄) and generate 2 parity chunks (P₁, P₂) based on the encoding scheme from the data (Reed-Solomon is a popular encoding scheme used to generate parity chunks) we can distribute the 6 chunks (D₁, D₂, D₃, D₄, P₁, P₂) across 6 nodes. With this setup, we can tolerate failures of upto 2 nodes and can still fully reconstruct the data from any of the 4 chunks. Basically, for K data chunks and M parity chunks, our system can tolerate up to M node failures.

Side note: checkout this blog from Marc Brooker where he details how erasure coding can also be used to optimize tail latency in applications

Write Ahead Logging

As the name suggests, before a transaction is actually performed on the data, the instruction is written into an append only log file. This form of logging is called write ahead logging (WAL). How does it help ? In case of any failures while commiting a transaction, the WAL file can simply be replayed to bring the data back to the expected state, making failure recovery easy.

Couple of advantages with this approach is that we don't need to store the state of the data, just the transaction is maintained which is much smaller in size. Also, due to the smaller updates that will be performed in the WAL it can can be quickly flushed to disk before a transaction can be considered "commited".

Note that usually when databases are reading data from disk, they have to read the entire block/page. Once the block is read, i.e. moved to in-memory it can be modified or read based on the offset ("address"). If our database is expected to have a high throughput, i.e. concurrent reads and writes, the I/O operation of constantly moving blocks in and out of memory will become very expensive and slow down our database query performance. Therefore, instead of flushing to disk for every transaction (which can introduce a lot of overhead), the database will flush the block for every X transactions or before it is moved out of memory. What if we crash before we could flush the updated block to disk ? This is where WAL can really save us, if we crash before we could flush the updated block to disk, we can simply replay the transactions from our WAL to redo our transactions. Fun fact: WAL is also called as redo log because of this reason.

Checkpointing

There is no point in maintaing all transactions in the WAL if we know if it has been applied to the data and flushed to disk. Therefore, periodically the database reconciles the data with the transactions in the WAL and purges the log. This is called checkpointing, a periodic marker which conveys that all transactions up to that point have been flushed to disk. Now, in case of any database crashes before flushing to disk, we only have to replay the transactions from the last checkpoint in the WAL.