#Day9 - Storage Engines (Fragmented LSM Tree)

A series of posts to understand databases better

Feb 16, 2023

Fragmented Log-Structured Merge Trees (FLSM) is a database storage algorithm that uses skip lists to manage the structure of the data. It was first introduced in the paper published by #VMware Research on #PebblesDB

In this algorithm, the data is stored in fragments or chunks in a log-structured manner. The data is kept in a sorted manner, and the search for the data is performed through the use of skip lists. Skip lists are a type of data structure that provides fast search times in large datasets. The skip lists are used to index the fragments, which in turn allows for fast access to the data. The merge operation in FLSM combines smaller fragments into larger ones to reduce the number of I/O operations required, which results in improved performance.

Understanding the Problem

One fundamental problem that remains is the high write amplification of key-value stores for write-intensive workloads. Write amplification is the ratio of total write IO performed by the store to the total user data. High write amplification is bad because:

It increases the load on storage devices such as SSDs, which have limited write cycles before the bit error rate becomes unacceptable
results in frequent device wear out and high storage
reduces write throughput
Write amplification also reduces write throughput: in the RocksDB [20] key-value store, it results in write throughput being reduced to 10% of read throughput

Key-value stores such as LevelDB and RocksDB are built on top of the log-structured merge trees (LSM) data structure, and their high write amplification can be traced back to the data structure itself (§2). LSM stores maintain data in sorted order on storage, enabling efficient querying of data. However, when new data is inserted into an LSMstore, existing data is rewritten to maintain the sorted order, resulting in large amounts of write IO.

Key idea to reduce write amplification

The Fragmented Log-Structured Merge Trees (FLSM), combines ideas from the Skip List and Log-Structured Merge trees data structures along with a novel compaction algorithm. FLSM strikes at the root of write amplification by drastically reducing (and in many cases, eliminating) data rewrites, instead fragmenting data into smaller chunks that are organized using guards on storage. Guards allow FLSM to find keys efficiently. Write operations on LSM stores are often stalled or blocked while data is compacted (rewritten for better read performance); by drastically reducing write IO, FLSM makes compaction significantly faster, thereby increasing write throughput.

Root cause for write amplification

Figure 2 illustrates compaction in a LSM key-value store. The key-value store contains two sstables in Level 1 initially. Let us assume that Level 0 is configured to hold only one sstable at a time; when this limit is reached, compaction is triggered. At time t1, one sstable is added, and a compaction is triggered is at t2. Similarly, sstables are added at t3 and t5 and compactions are triggered at t4 and t6. When compacting a sstable, all sstables in the next level whose key ranges intersect with the sstable being compacted are rewritten. In this example, since the key ranges of all Level 0 sstables intersect with key ranges of all Level 1 sstables, the Level 1 sstables are rewritten every time a Level 0 sstable is compacted. In this worst-case example, Level 1 sstables are rewritten three times while compacting a single upper level. Thus, the high write amplification of LSM key-value stores can be traced to multiple rewrites of sstables during compaction.

How it works?

FLSM can be seen as a blend of an LSM data structure with a Skip List along with a novel compaction algorithm that overall reduces write amplification and increases write throughput. The fundamental problem with log-structured merge trees is that sstables are typically re-written multiple times as new data is compacted into them. FLSM counters this by fragmenting sstables into smaller units. Instead of rewriting the sstable, FLSM’s compaction simply appends a new sstable fragment to the next level. Doing so ensures that data is written exactly once in most levels; a different compaction algorithm is used for the the last few highest levels. FLSM achieves this using a novel storage layout and organizing data using guards.

In the classical LSM, each level contains sstables with disjoint key ranges (i.e., each key will be present in exactly one sstable). The chief insight in this work is that maintaining this invariant is the root cause of write amplification, as it forces data to be rewritten in the same level. The FLSM data structure discards this invariant: each level can contain multiple sstables with overlapping key ranges, so that a key may be present in multiple sstables. To quickly find keys in each level, FLSM organizes the sstables into guards (inspired from the Skip-List data structure.

Each level contains multiple guards. Guards divide the key space (for that level) into disjoint units. Each guard Gi has an associated key Ki , chosen from among keys inserted into the FLSM. Each level in the FLSM contains more guards than the level above it; the guards get progressively more fine-grained as the data gets pushed deeper and deeper into the FLSM. For more details, I've attached the paper for reference below.

Disadvantages

Complexity: The algorithm is complex and can be difficult to understand, which makes implementation and maintenance challenging.
Overhead: The algorithm requires a lot of memory, which can result in performance degradation and slow data retrieval times.
Slower read performance

Reference

https://research.vmware.com/files/attachments/0/0/0/0/0/2/7/sosp17-pebblesdb.pdf

Database Internals Newsletter

#Day8 - Storage Engines (Fractal Tree)

Fractal Trees are a type of index structure used in databases to store and manage data on disk. It gained attention when it was used as storage algorithms in databases such as RocksDB and TokuDB (acquired by Percona). These databases use Fractal Trees for high-performance, scalable indexing and storage of large amounts of data. The Fractal Tree algorith…

2 years ago · Farhan

Arriqaaq Newsletter

Discussion about this post