#Day7 - Storage Engines (LSM-Tree)

A series of posts to understand databases better

Feb 08, 2023

LSM (Log-Structured Merge) tree is a data storage algorithm used for storing and managing large amounts of data in a log-structured way. It is a variant of the B-tree, which is a type of self-balancing tree data structure.

The LSM tree is designed to handle high write loads and high levels of data fragmentation. It uses a log-structured approach for storing data, which provides efficient write performance, scalability, and data durability. It is a popular choice for databases and distributed systems which handle high write loads and high levels of data fragmentation.

How do LSM trees work?

No alt text provided for this image — creativcoder.dev

An LSM tree is composed of two types of data structures: memtables and sstables (sorted string tables). Memtables are small, frequently written-to data structures that hold recently added data. Sstables are larger, less frequently written-to data structures that hold older data that has been merged from memtables.

When new data is added to the LSM tree, it is first written to the active memtable. Once the memtable reaches a certain size, it is merged with one or more existing sstables to form a new sstable. This process, known as compaction, helps to reduce fragmentation and improve overall performance.

The LSM tree also uses a technique called write-ahead logging (WAL) to ensure data consistency and durability in the event of system crashes or power failures. This involves writing changes to a log file before they are committed to the LSM tree, so that any incomplete changes can be rolled back in the event of an interruption.

When a query is made, LSM tree first checks the memtable for the required data, if it is not found, it will check the sstables in order of recentness. This design gives a trade-off between read performance and write performance. It is optimized for write-heavy workloads, but with a cost of slower read performance.

What are the advantages of LSM-Tree?

High write performance: LSM trees are optimized for write-heavy workloads, as they allow new data to be added to the tree quickly and efficiently. The log-structured approach used by LSM trees reduces the need for random writes, which can slow down write performance in traditional B-trees.
Scalability: LSM trees can handle large amounts of data and high levels of data fragmentation, making them well-suited for use in databases and distributed systems. They can scale to handle very large datasets and handle high write-loads.
Data durability: LSM trees use write-ahead logging (WAL) to ensure data consistency and durability in the event of system crashes or power failures. This helps to protect data from loss or corruption.
Compaction: LSM trees use a compaction process to merge small, frequently written-to data structures called memtables into larger, less frequently written-to data structures called sstables. This helps to reduce fragmentation and improve overall performance.
Disk Space efficiency: LSM tree's compaction process helps in reducing the disk space usage by removing the duplicate keys.

What are the disadvantages of LSM-Tree?

Slower read performance: Because LSM trees are optimized for write-heavy workloads, read performance can be slower than in traditional B-trees. This is because read requests may need to check multiple sstables in order to find the requested data.
Increased disk space usage: LSM trees can use more disk space than traditional B-trees because of the compaction process, which creates multiple sstables. Each sstable contains a copy of the same data, which can lead to increased disk space usage.
Increased write amplification: LSM trees can lead to increased write amplification, which is the number of writes that must be performed to the storage device in order to update a single logical value. This is because the compaction process generates many small writes that are spread across multiple sstables.
Not good for update-heavy workloads: LSM tree is not well suited for update-heavy workloads, where data is frequently updated or deleted. As the LSM tree stores the data in sorted order, it can cause a lot of disk I/O when updating the data.
Large memory requirements: LSM tree requires a large amount of memory to hold the memtable, which can be a problem for systems with limited memory.
Not good for random access workloads: LSM tree is not well suited for random access workloads, where data is accessed randomly. As the LSM tree stores the data in sorted order, it can cause a lot of disk I/O when accessing the data randomly.

How databases that use LSM-Tree guarantee durability from server crashes?

https://www.researchgate.net/publication/324640550_A_survey_on_NoSQL_stores

LSM tree uses WAL to write data changes to a log file before committing them to the LSM tree. This way, if there is any interruption, the LSM tree can restore the data to its previous state by replaying the log file, ensuring data consistency and durability.

When a new data is added to the LSM tree, it is first written to the active memtable. Before committing the changes to the memtable, the changes are also written to a log file. This log file is called write-ahead log (WAL). The WAL acts as a buffer between the memtable and the hard disk.

In the event of a system crash or power failure, the LSM tree can use the WAL to restore the memtable to its previous state. The log records are replayed from the WAL to restore the in-memory data structures. This process helps to ensure that any changes that were in progress at the time of the interruption are rolled back, so that the data remains consistent.

Once the data is written to the WAL, it can then be committed to the memtable. After that, the memtable can be compacted with one or more existing sstables to form a new sstable. The WAL file is then truncated, and the process repeats.

What types of databases are LSM trees useful for?

LSM (Log-Structured Merge) trees are well-suited for databases and distributed systems that handle high write loads and high levels of data fragmentation. They are particularly well-suited for the following types of databases:

Time-series databases: LSM trees are well-suited for time-series data, which is naturally sorted by time. They can handle large amounts of data and high write loads, making them a good choice for storing time-series data such as sensor data, financial data, and log data.
Big Data databases: LSM trees are well-suited for storing large amounts of data and can handle high write loads, making them a good choice for big data databases. They are used in distributed systems such as Apache Cassandra and Google's Bigtable.
Append-only workloads: LSM trees are well-suited for append-only workloads, where new data is added but no data is deleted or modified. This is because the compaction process can handle high levels of data fragmentation, and the WAL can ensure data durability.
Event sourcing databases: LSM trees are well-suited for event sourcing databases, where the data is stored as a series of events. They can handle large amounts of data and high write loads, making them a good choice for storing event data.
IoT databases: LSM trees are well-suited for IoT databases, where large amounts of data are generated by IoT devices. They can handle high write loads and high levels of data fragmentation, making them a good choice for storing IoT data.

What popular databases use LSM-Trees?

Several popular databases use LSM (Log-Structured Merge) Trees as their underlying data storage algorithm, including:

Apache Cassandra: Cassandra is a highly scalable, distributed NoSQL database that uses LSM trees for data storage. It is designed to handle large amounts of data across many commodity servers, and is used by many large organizations, including Netflix, eBay, and Twitter.
Google Bigtable: Bigtable is a distributed NoSQL database that is used internally by Google for many of its services, including Google Earth, Google Maps, and Gmail. It is based on LSM trees and is designed to handle very large amounts of data.
HBase: HBase is an open-source, distributed NoSQL database that is built on top of the Hadoop Distributed File System (HDFS). It is modeled after Google's Bigtable and uses LSM trees for data storage.
RocksDB: RocksDB is an open-source embedded key-value store that is based on LSM trees. It is a library that can be used as a storage engine for various databases such as MySQL, Redis and MongoDB.
LevelDB: LevelDB is an open-source key-value store that is also based on LSM trees. It is designed to be a fast, low-level storage engine for key-value data.
Badger: Badger is an open-source key-value store that uses LSM (Log-Structured Merge) trees as its underlying data storage algorithm. It is built on top of the RocksDB library, which is also based on LSM trees.

Notes:

1) Issues with compaction in LSM-Tree

Compaction is an important process in LSM (Log-Structured Merge) trees, as it helps to maintain the performance of the database by merging multiple smaller sstables into larger ones. However, compaction can also cause some issues for LSM trees, including:

Increased write amplification: As compaction generates many small writes that are spread across multiple sstables, it can lead to increased write amplification, which is the number of writes that must be performed to the storage device in order to update a single logical value.
Performance degradation during compaction: Compaction can cause performance degradation, as it can require a lot of disk I/O and CPU resources. This can cause the database to slow down or become unresponsive during compaction.
Difficulty in handling large amounts of data: LSM trees can have difficulty handling very large amounts of data, as the compaction process can generate a large number of sstables.

To address these issues, a specific compaction strategy called "Levelled Compaction" is used in some LSM tree-based databases. In this approach, the data is divided into levels, and each level contains a set of sstables (Sorted String Table). Each level is roughly 10 times larger than the previous level. When a new sstable is added to the highest level, if the number of sstables in that level exceeds a certain threshold, the compaction process begins. The compaction process merges some of the sstables in that level into larger sstables, and the resulting sstables are moved to the next lower level.

The advantage of this approach is that it reduces the number of sstables that need to be compacted at any given time, which reduces the write amplification and improves the performance of the database. It also allows for more efficient use of disk space, as the smaller sstables in the higher levels are merged into larger sstables in the lower levels. This approach is used in some LSM tree-based databases such as RocksDB and LevelDB.

2) LSM-Tree + B-Tree

A combined approach of B-tree and LSM (Log-Structured Merge) tree is sometimes used in databases to take advantage of the strengths of both data structures. This approach is sometimes referred to as a "hybrid" approach.

In this approach, a B-tree is used for read-heavy workloads, as it provides efficient lookups and range queries. The B-tree is also used for small updates and deletions, as it can handle these operations efficiently.

On the other hand, an LSM tree is used for write-heavy workloads, as it can handle high write loads and high levels of data fragmentation. The LSM tree is also used for large updates and deletions, as it can handle these operations efficiently.

The use of a B-tree and an LSM tree together allows the database to take advantage of the strengths of both data structures, and can provide improved performance and scalability. This approach is used in some databases like RocksDB and LevelDB.

3) Other approaches

There has been a fair bit of further work building on the LSM approach. Yahoo developed a system called Pnuts which combines LSM with B trees and demonstrates better performance. There are also related approaches which have similar properties but retain an overarching structure. These include Fractal Trees and Stratified Trees.

We will try discussing fractal trees next.

#database #sql #algorithms #systemdesign #arriqaaq #datastructures #immudb #mysql #postgres #mongodb #berkeleydb #tile38

Arriqaaq Newsletter

Discussion about this post