This article gives a glimpse of internals of some of the popular storage structures used in databases and distributed systems.
The following storage structures will be covered in this article: Bloom Filter, LSM Trees, B+ Trees, Inverted Index, Merkel Trees, Consistent Hashing, Skip lists, HyperLogLog, Count Min Sketch.
Bloom Filter is a probabilistic data structure that is used to determine the membership of an element in a set of elements.
It is probabilistic because it can produce false-positive matches (The data structure can return the result saying an element is possibly present in the set which may not be 100% accurate), but it cannot produce false-negatives (If the data structure returns the result saying an element is not present, it is 100% accurate).
It is mainly used as an auxiliary data structure alongside many databases like Hbase, Cassandra, Postgresql which is used to skip the underlying disk access for false-negative scenarios.
The primary data structure used in Bloom Filter is a Bit Vector. The initial state of the Bit Vector will be all empty (0s). Each Bloom Filter is associated with k hash functions whose job is to accept input and the input is passed through each of the k hash function and generated bit indices are set to 1 in the Bit Vector.
During the existence check, the input is passed through the same k hash functions and the generated bit indices are bitwise ANDed to check if there are any bits whose value is 0.
If any of the value is 0 (false-negative), it signifies element is not present in the set which otherwise would have been set by the hash function during the insert operation. If all the bit values are 1, then the element might be present (false-positive). The reason for this probability is because the hash function would have set the bits to 1 for a different input.
Bloom Filter cannot account for the deletion of entries since it does not store the actual data in the data structure and unsetting the bits might lead to unsetting the values for other inputs generated by the hash function.
Bloom Filter is highly space-efficient since the resulting Bit Vector can be stored as a number in the disk and loaded back into memory. The time complexity of insert and existence operation is O(k) and it is largely dependent on the underlying hash function.
Some considerations for Bloom Filter data structure
- Larger the Bit vector, lesser the probability of false-positives since the set bits will be uniformly distributed over the data structure
- Choosing an optimal number of hash functions is important and hash functions should be computationally cheaper. The optimal number of hash functions can be determined by three parameters n (number of items in the data structure), p (probability of false-positives acceptable) and m (number of bits in the bit vector). https://hur.st/bloomfilter/?n=4000&p=1.0E-7&m=10000&k= gives a calculator to find the optimal number of hash functions for the use-case.
Some of the use-cases
- It is mainly used as an auxiliary data structure alongside many databases like Hbase, Cassandra, Postgresql which is used to skip the underlying disk access for non-existent elements
- Used by Medium website to skip showing articles to the user who has already read it.
- The Google Chrome web browser used to use a Bloom filter to identify malicious URLs.
LSM Tree (Log-Structured Merge tree)
LSM (Log-Structured Merge) Tree is a data structure that stores key-value pairs that are optimized for write-heavy workloads and is used in many popular databases such as Hbase, Cassandra, etc…
The fundamental idea behind LSM is — writes are appended to a segmented log file that is sorted which makes writes extremely fast and efficient and reads will go through multiple segment files to find the latest entry for a given key. Periodically, segment files are compacted in the background to reconcile and remove duplicate entries and form larger compacted segment files.
Following are major components which form an LSM Tree
- An in-memory balanced binary search tree structure such as AVL Tree or Red-Black Tree often called memtables.
- A disk-based data structure that sorts data by keys called SSTables (Sorted String Tables). There could be multiple SSTables on disk at any given point.
- A WAL (Write-Ahead Log) captures every write operation made to the system in an append-only fashion.
- Bloom Filter to reduce disk access on false-negatives.
Memtables are self-balanced binary search trees whose time complexities of inserts, updates, and reads are O(logN). It also prevents the tree from getting skewed in one direction by periodically balancing the left and right subtrees.
SSTable (Sorted String Table) is an on-disk data structure that packs key-value pairs sorted by key. It is accompanied by an index file (often implemented as a B+ tree) to speed up lookups on SSTable by key.
SSTable stores neighboring key-value together in a compressed fashion and the index file only stores keys which point to the start of the compressed block. This way the index file is sparse and space-efficient without compromising on range-scans.
Every write of key-value pair goes through the active memtable and stays there till the size of memtable grows beyond a certain threshold.
This write operation is also appended to WAL(Write-Ahead Log) data structure before considering the write successful so that writes are persistent on crashes and restarts since memtables are memory-resident.
When the size of the memtable goes beyond a certain threshold, the contents are flushed to disk as an SSTable. During the flushing process, a new memtable is created for receiving writes and reads, and the previous memtable is available for serving reads until the flushing to SSTable is complete.
Once the flushing is complete, the previous memtable in memory is discarded and new memtable is available for reads and writes and the flushed SSTable is available for reads.
Every read operation first goes through a Bloom Filter to reduce disk access on false-negatives.
If the result of Bloom Filter turns to be a false-positive, it queries the current in-memory memtable which stores the latest snapshot of the key. Reads on memtables will be in the order of O(logN). If the value is not present in the memtable, then it queries SSTables on disk.
At a given point in time, there could be multiple SSTable data structures on disk. The search is performed on SSTables from the latest SSTable to the oldest till the value is found. It uses an index file on each SSTable to speed up lookups.
Reads are generally slower than writes in LSM because the search has to propagate all the way from memtables to one or more SSTables till the value is found. The number of SSTables is kept to minimum by periodically compacting and merging older segments into newer segments.
Deletes are a special case that is performed by appending a special deletion record called tombstone which signals the deletion of a key. Read queries understand tombstone while going through multiple SSTables and return the response to the client appropriately. Compaction considers tombstone records and deletes actual data pair during the merging process.
Since memtables get flushed at regular intervals to SSTables, the number of SSTables on disk increases over a period of time and the keys will be duplicated across SSTables which affects disk space and also read queries. This introduces the need for compaction and merging of segment files.
Since the content of individual SSTable is sorted, a multiway merge-sort algorithm is used for merging. A multi-way merge sort using a priority queue such as min-heap which can hold up-to N elements where N is the number of uncompacted/unmerged segment files.
Each SSTable eligible for compaction has an iterator and the first-element of each SSTable is inserted into the priority queue and min element is popped out of the heap and added to the final segment. The iterator associated with the popped element inserts the next-in-line element from SSTable into the heap and this process continues till all the iterators are exhausted.
There are different strategies for compaction
- Size-tiered compaction where compaction is triggered when there are enough smaller segments of similar sizes and these smaller segment files are merged into larger segment files.
- Level tiered compaction which uses small, fixed sized SSTables divided into different levels which guarantee keys to be non-overlapping within the same level.
LSM trees are used in many popular databases like RocksDB, WiredTiger (storage engine of MongoDB), LevelDB, Hbase, Cassandra, ScyllaDB.
B+ tree is a self-balanced binary search tree more suitable for on-disk storage and is used in many popular databases such as MySQL, Postgresql, MongoDB, etc…
It is a mutable data structure which updates data-records in-place directly in the target location.
In-memory Balanced Search Trees such as AVL Trees, Red-Black Trees cannot be used for on-disk implementation because of the low fanout and larger height.
Fanout is the maximum number of children allowed per node. Since balanced search trees have a fanout of 2, we need log2(N) searches to search an element. For an on-disk implementation, it means log2(N) disk seeks which is very expensive.
Larger height affects the locality of the parent-child relationship which in-turn increases the number of random disk seeks.
So the primary requirement of balanced search trees to be used in on-disk implementation are high fanout and low height.
B+ trees work by breaking down the database into fixed-size pages or blocks and read/write a page at a time.
B+ trees consist of three different types of nodes. Each node can hold up-to N keys and N+1 pointers to the child nodes.
- Root node: This has no parent and can hold up-to N keys.
- Leaf nodes: This has no child node and it contains the actual data.
- Internal nodes: These are the connecting nodes from the root node to lead nodes. They store separator keys that guide the search of an element from root to the leaf node.
Keys stored in B+ trees are called separator keys (Ex: 1, 2) which split the tree into two sub-trees. These keys are sorted which allows us to perform binary search on it.
The pointer between two separator keys includes elements holding the corresponding range. The pointer to the left of the separator key includes elements that are lesser than the key and the pointer to the right of the separator key includes elements that are greater than the key.
Since all the values are stored on the leaf node in sorted order, range scans are simplified. Some of the B+ tree implementations form a doubly-linked list on keys to facilitate traversal in either direction.
Given a key, the search starts from root to leaf to find the node with a separator key which is greater than the search value which we call predecessor node. Once the target location is found, the key and value are appended to it linking it to predecessor node.
If the target node does not have enough space, we say that the node has overflowed and the node has to be split into two to fit the new data.
Split happens under two scenarios
- In the case of leaf nodes, if the node holds up to N key-value pairs and we want to insert one more key-value pair.
- In the case of non-leaf nodes, if the node holds up to N+1 pointers and we want to insert one more pointer to accommodate the newly inserted key-value pair.
Splits are done by allocating the new node, transferring half the elements from the splitting node to it, and promoting the first key and pointer to the parent node. The index at which the split is performed is called the split point. This operation can propagate recursively up the tree until the root node.
When the tree reaches its capacity we have to split the root node. When the root node is split, a new root, holding a split point key, is allocated increasing the tree height by one.
To find a key in a B+ tree, the algorithm starts from the root and performs a binary search, comparing the key with the keys in the node until we find the first separator key that is greater than the searched value (predecessor node).
In the case of point queries, we continue the search from the predecessor node to the leaf until we find an element or the element does not exist. In the case of range queries, we use the predecessor node to find the closest found key-value pair and continues by following sibling pointers until the range condition becomes false.
Updates use the same lookup algorithm as an insert to find the predecessor node to update. Once the target leaf node is found, it updates the corresponding key-value pair in-place.
Deletes use the same lookup algorithm as an insert to find the predecessor node to delete. Once the target leaf node is found, it deletes the corresponding key-value pair in-place.
If the occupancy after deletion of neighboring nodes falls below a certain threshold, the condition is called underflow and neighboring nodes are merged in this case.
Merge happens under two scenarios
- In the case of leaf nodes, if a node can hold up to N key-value pairs, and the combined number of key-value pairs in two neighboring nodes is less than or equal to N
- In the case of non-leaf nodes, if a node can hold up to N+1 pointers, and the combined number of pointers in two neighboring nodes is less than or equal to N+1
Merge operation can propagate recursively up to the root.
Deletes only remove cell offsets from the header. When the page is split, only offsets are trimmed. This leads to page fragmentation.
Most of the B+ tree implementations run compaction/defragmentation periodically and asynchronously to reclaim unused spaces and compacting.
An inverted index is a data structure that is majorly used in search engines such as Elasticsearch facilitating structured and full-text searches.
As the name suggests, it inverts the traditional index. Given a sentence, it breaks the sentence into individual words(referred to as tokens) based on a tokenizer and for each token, it stores the list of sentences that contain a particular word.
These words are sorted which internally uses SSTables.
Most of the search engines normalize the tokens before storing it in the inverted index.
The process of tokenization and normalization is called analysis.
The analysis goes through three important steps
- Character filters: The content is passed through the various character filters whose job is to tidy up the content before tokenization. Ex: Stripping of HTML, converting & to & etc…
- Tokenizer: The job of the tokenizer is to split the content into tokens based on a tokenizer which uses different semantics to split the content and emit tokens. There are a variety of tokenizers such as whitespace tokenizer, punctuation tokenizer, language-specific tokenizers etc…
- Token filters: Each token emitted from tokenizer is passed through various token filters which can add more tokens,(Ex: synonyms) remove tokens, (Ex: stop words) replace tokens (Ex: stemming)
The resulting set of token filters is persisted into the inverted index.
Each of the input search queries also goes through the analysis phase before handing over the query to the inverted index.
Each of the inverted index entry store a lot of metadata which might aid in improving search relevance.
Some of the commonly stored metadata include
- Term frequency: How many times, a given token appears in the document. More the number of occurrences will indicate more relevance.
- Inverse Document frequency: How many times, a given token appears in all other documents. More the number of occurrences will indicate less relevance.
- Length of the sentence: Smaller the sentence, more relevant it is.
The inverted index also maintains a Bit Vector the list of document ids for each token to speed up the search (similar to Bloom Filter)
Merkel tree is a tree data structure that is used to verify the consistency and data integrity of data over a large system.
It is predominantly used in blockchain technologies and also in dynamo style databases to fix data inconsistencies between two nodes/replicas. It is also used in version control systems such as Git.
Merkel trees work by taking all data nodes and creating a cryptographic hash out of the data. Each pair of data nodes is merged to create an internal node which is a combined cryptographic hash of the left and right child. This process continues recursively by combining every pair of internal node until the root node is created. The root node denotes the hash of the complete tree.
Whenever the content of any of the data node changes, it’s hash is recomputed and this recomputation propagates to all of its ancestors all the way up to the root node.
This prevents any malicious system corrupting data node since the hash of node and its ancestors do not match on modification.
Dynamo style database (the database which adopts leaderless replication) such as DynamoDB and Cassandra use Merkle tree to fix data inconsistencies between any two nodes during shard reassignment between the nodes during elastic scaling or during backfilling nodes when one of the nodes comes online after being offline for a definitive period.
Whenever we want to verify the data consistency of two systems, we create Merkel trees out of data nodes in both the systems. We then start comparing the hash of root nodes in both the trees. If the hashes match, then the systems are consistent with each other.
If the hash does not match, we traverse the left and right subtree and compare the hashes to find the subtree which has a mismatch. We continue this process recursively until we find data nodes causing the discrepancy.
Consistent Hashing is a distributed hashing technique that solves elastic scaling (when nodes are added or removed, there is a minimal data movement across nodes).
It is largely used in distributed systems such as Cassandra, Riak, Dynamo, Load balancers, caching servers, etc…
Problem with Hashing
Let’s assume we have N servers for storing data that needs to be load balanced for reads and writes. Any incoming request is passed through a hash function f(x) = v, the output of which is passed through a function which computes output modulo N. As long as the hash function is good, reads and writes will be uniformly distributed across servers.
Let’s say we add a new node, bumping up the number of servers to N+1. Now when the input is passed through the same hash function, the output of modulo operation by N+1 gives the wrong server id to read/write data.
Let’s say one of the nodes is taken offline, which brings down the number of servers to N-1. Now when the input is passed through the same hash function, the output of modulo operation N - 1 gives the wrong server id to read/write data.
To solve this problem we need to rehash and redistribute the data across servers when nodes are added/deleted before serving reads and write which is not a feasible solution.
Solution: Consistent Hashing
Consistent hashing technique operates independently of the number of servers or data points in the system.
Consistent hashing works by taking the larger range of numbers the hash function generates (say 0 to INT_MAX) and placing them in a ring such that the last value(INT_MAX) wraps the first value(0).
It then proceeds by placing N servers on the ring at a position using the same hash function which computes output modulo INT_MAX (last value in the ring).
Each read/write request is passed through the same hash function (output module INT_MAX).
- If the output of the function matches with the hash of the server where it is placed, read/write operations are computed on that particular node.
- If the output of the function does not match with the hash of the server where it is placed, we proceed clockwise to find the first server encountered and that becomes the target node for the read/write request.
When we add a new node between two servers, only the keys between the new server and the server to its right need to be remapped and redistributed (num_keys/num_slots need to be moved on average).
Similarly, when we remove a new node, the keys of the deleted node are remapped/redistributed to the server on the right(num_keys/num_slots need to be moved on average).
This way, Consistent hashing achieves minimal redistribution of data on elastic scaling.
Also some of the distributed databases using consistent hashing employ replication to avoid redistribution and having a master/replica strategy.
A skip list is a probabilistic data structure that is built using a linked list as its underlying structure.
The data structure includes lanes of linked lists with each lane containing only fewer elements than the lane below it.
It starts from a sorted linked list which includes all the elements. An express lane is built on top of the original linked list, connecting increasingly sparse subsequences of the items. This way it builds a hierarchy of linked lists.
Each node in the linked list includes four-pointers, a left pointer, a right pointer, a pointer to point to an express lane above it and a pointer to point to a lane below it.
The search always starts from the top lane (which is called express lane) and it continues till the element which is greater than or equal to target is found.
If the element on the express lane is equal to the target, the algorithm uses the down pointer to reach the lowest lane to get the complete data.
If the element on the express lane is greater than the target element, the algorithm goes to the previous node in the express lane and continues the search downwards in the lower lane. This process continues down the lanes until the element is found.
Skip lists are very useful when we want to concurrently access our data structure. The time complexities of Skip lists are similar to Balanced binary search trees like Red-black tree, but it has one advantage over Balanced BSTs.
If we insert a new node into the balanced BSTs, we might have to rebalance the entire tree, and we won’t be able to access our data concurrently while this is going on.
In a skip list, if we have to insert a new node, only the adjacent nodes will be affected, so the data structure can be accessed concurrently.
HyperLogLog is a probabilistic data structure that is used for cardinality estimation in a multiset.
It has sub logarithmic space complexity and O(1) time complexity.
We could achieve the same end result using a set and adding every element to the set and finding the length of the set. This results in a huge set and storing every single element in memory. This explodes space complexity linearly with data size. This solution cannot scale well for billions of elements.
The fundamental idea behind HyperLogLog — when input is to be added to a given multiset, it is passed through a hash function which generates a bitstring.
If the hash function generates a uniform distribution of bitstring, we will have a 50% probability of bitstring starting with 0 and the other 50% bitstring starting with 1. Similarly, 1/4th of the bitstrings start with 00, 1/8th of the bitstrings start with 000.
The problem of finding cardinality estimation now boils down to finding the longest sequence of leading 0s.
If the longest sequence is 000, then we could approximately estimate there are 8 unique elements in the set.
Internally the bitstring is split into multiple segments, and the cardinality is computed as the harmonic mean of 2^z for each of those segments. Bias from hash collisions is handled by a multiplicative constant, and linear counting is used to give more precise counts for smaller sets.
It is used in a large number of databases such as Redis, Elasticsearch, Google BigQuery, etc…
Count Min Sketch
Count Min Sketch is a probabilistic data structure that is used to find the frequencies of elements in a stream of data.
We could achieve the same end result using a hash table by storing element value as a hash key and counter as the hash value which gets incremented every time a particular element is encountered.
This results in a huge hash table and storing every single element in memory. This explodes space complexity linearly with data size. This solution cannot scale well for billions of elements.
Count min sketch uses sub-linear space complexity and uses various hash functions to map elements to their frequencies. The only drawback of the algorithm is it is probabilistic and it can overcount (it cannot undercount) some of the elements due to collisions.
Count min sketch typically employs a matrix with each row denoting a hash function and each column denoting the range of values produced by the hash function.
The matrix is initialized to all 0s. For a given input, the algorithm applies all the hash functions to get a list of outputs. These outputs are incremented in their respective index positions in the matrix with the hash function as the row index(i) and output of hash function as the column index(j).
When we want to count the number of occurrences for a given input, we apply the same hash functions to the input and we get the corresponding value from the index position in the matrix with the hash function as the row index(i) and output of hash function as the column index(j). We take the minimum of all these values and that is the frequency of a particular input. This decision of minimum is needed to handle cases of collision.
Since the matrix uses hash functions as rows and range of numbers generated by hash functions as columns, the space complexity of the algorithm is sublinear irrespective of the size of the stream of data.