Internals of Redis

sudan
20 min readJan 19, 2025

--

In this article, we will dive deep into the internals of the Redis database.

We will start with Redis's single-threaded, non-blocking I/O architecture and explain how the Event loop is implemented. We will then gradually cover the implementation details of various data structures in Redis, followed by Redis persistence strategies.

Redis

Introduction

Redis is an open-source in-memory database written in C programming language by Salvatore Sanfilippo in 2009.

It is essentially an in-memory key-value datastore where the key is a string and the value can be one of different data structures, such as a HashMap, Set, Sorted Set, String, or List. Hence, it is also referred to as a data structure server.

The primary use case of Redis is mainly around caching with optional persistence.

Process and Filesystem

Before we get into Redis internals, we need to understand the core concepts: Process and Filesystem.

Process

A Program is an algorithm expressed as a sequence of instructions and a Process is an instance of a program being executed.

Process memory space is divided into the following four sections:

  • Text: Contains the program’s compiled code loaded into memory from the disk.
  • Data: Contains the static and global variables initialized during the start of the process.
  • Heap: Contains dynamically allocated variables for the program and remains in the heap until explicitly destroyed.
  • Stack: Contains local variables and return values initialized during the function call and destroyed when the function returns. The Stack and Heap sections grow in opposite directions.

Each process maintains a Process Control Block(PCB) to store information about a process.

PCB

Each PCB contains the following pieces of information:

  • PID and Parent PID: PID(Process Identifier) is the unique identifier for the process. Parent PID is the PID of the parent process.
  • Process state: Indicates the process's state: Created, Ready (waiting for CPU slice), Running (running on CPU slice), Waiting (waiting on I/O event), and Terminated.
  • Program Counters and Registers: The program counter indicates the address of the next instruction to be executed for the process. Registers serve as temporary context storage when the process is swapped out of CPU on interrupt.
  • CPU scheduling information: Contains information about the pointers to the queues holding PCBs.
  • Memory Management Information: Contains information about page tables and segment tables.
  • List of open files: Contains the list of files opened by the process.

Filesystem

The Filesystem is a critical operating system component, responsible for organizing, storing, and retrieving data on a storage medium.

It facilitates the efficient organization of data into files and directories. Additionally, it handles efficient space management by tracking free space and provides data integrity and crash recovery. Examples of filesystems include: ext2, ext3, XFS, ZFS etc…

Data Block

A data block is the smallest unit of data storage in a filesystem. Files are broken down into multiple blocks and stored on a storage medium. Typical block sizes are 512 bytes, 1 KB, 2 KB, 4 KB, and 8 KB.

Inode

An Inode stores metadata about a file or directory. It includes:

  • Inode number
  • Type of file such as regular, directory, symbolic links, etc…
  • Permissions (Read, Write, Execute permission on User, Group, and Others)
  • Owner and size
  • Timestamp ( Access time, modification time, created time)
  • Reference counter on active references
  • Pointer to data blocks

File descriptor (FD)

Table hierarchy of file system mapping

Each process has a Process Control Block (PCB) which stores information about the process. It also has a File Descriptor Table which stores all the file descriptors associated with that process.

A file descriptor is a process-unique identifier for an open connection to a file. Every Unix process should have three file descriptors corresponding to three standard streams: Standard Input (file descriptor 0), Standard Output (file descriptor 1), and Standard Error (file descriptor 2).

Other open files correspond to file descriptors with descriptor IDs higher than standard streams.

Each file descriptor table maps to a single System Wide Open File table(SWOF) that stores open connections for the files. Each entry in this table includes a PID, File descriptor ID, mode of file access (read, write, or append), and offset of the file at which the file is being operated.

Different processes may open the same file with different access and offsets. Similarly, the same process can have multiple connections to a file with different offsets and access. Hence the SWOF table is kept as a separate table holding all the open connections to the files.

SWOF Table maps into an Inode table which stores the Inode number and metadata about the file along with pointers to data blocks.

Socket

A Socket is a virtual endpoint that acts as a communication channel between two applications running on the same or different network device facilitating data exchange.

A Socket is identified by a combination of IP Address and port number specifying the host server and the application inside it for data exchange.

select(), poll() and epoll()

Before discussing event loop architecture, we need to understand how select(), poll(), and epoll() work in Unix/Linux to provide I/O multiplexing.

These are predominantly used in network programming and applications with multiple concurrent connections to provide non-blocking I/O.

I/O multiplexing

I/O stands for Input/Output which refers to the data being transferred from one device to another.

For example, when a user inputs data through the keyboard, it is transferred to the main memory and then stored on the disk. Similarly, when data is read from the disk, the output is transferred to the main memory and then displayed to the user.

Reading and writing data to disk or the network is considered an I/O operation. The CPU is the compute processing unit while I/O operations involve waiting for disk or network to provide input and output.

I/O intensive applications are those applications that have predominantly I/O operations whereas CPU intensive applications are those applications that are predominantly compute based.

  • Blocking I/O: Process makes an I/O request to the operating system and waits till it returns with the data.
  • Non-Blocking I/O: Process performs other operations while waiting for data, rather than blocking until data arrives.

I/O multiplexing is a mechanism the operating system uses to notify a process when it is ready for I/O. It works by monitoring multiple file descriptors simultaneously and identifying the ones ready for I/O.

It improves the application’s throughput for I/O intensive applications providing a non-blocking I/O model.

select()

The select() system call in Unix enables I/O multiplexing by accepting a set of file descriptors and determining if any of them are ready for I/O.

int select(
int nfds,
fd_set * readfds,
fd_set * writefds,
fd_set * exceptfds,
struct timeval * timeout);

The command takes the following parameters:

  • readfds: File descriptors to be monitored to check for read readiness.
  • writefds: File descriptors to be monitored to check for write readiness.
  • exceptfds: File descriptors to monitor for exceptional conditions.
  • nfds: Set to the highest-numbered file descriptor in all of the three sets readfds, writefds, exceptfds plus 1.
  • timeout: Indicates the timeout the command should be blocked when no file descriptor in the set is ready.

The application starts by initializing the set of file descriptors that require monitoring for reading, writing, and exceptional conditions. The data structure fd_set operates on bitmap where every bit set corresponds to a file descriptor ID being monitored.

When a select() system call is made, the kernel iterates through all the file descriptors and checks if they are ready for I/O.

If any file descriptors are ready, the kernel sets the appropriate bit in the respective bitmap of read, write, or exception struct in response. If none of the file descriptors are ready, the kernel blocks the process until the specified timeout.

Limitations of the select() system call:

  • Every call to select() requires the kernel to iterate through all the file descriptors and compute if they are ready for I/O operation.
  • Every call to select() requires the kernel to examine all the bits in the respective bitmaps even if many of the file descriptor IDs are inactive. This also leads to larger bitmasks.
  • The FD_SETSIZE limits the number of file descriptors.
  • Since the same variable of fd_set is reused between the process and the kernel, the kernel overwrites the variable and the application has to keep a copy of the bitmask and re-build every time on overwrite.

Once select() returns, the application iterates through all the bits including the inactive file descriptors to determine the file descriptors ready for I/O.

poll()

poll() system call in Linux, also facilitates I/O multiplexing similar to select() system call with a slightly performant implementation.

int poll(struct pollfd *fds, nfds_t nfds, int timeout);

struct pollfd {
int fd; /* file descriptor */
short events; /* requested events */
short revents; /* returned events */
};

The command takes the following parameters:

  • fds: A set of file descriptors to monitor for I/O operations.
  • nfds: Specifies the number of file descriptors in fds.
  • timeout: Indicates the timeout poll() should be blocked when no file descriptor set is ready with the event.

Each entry in fdsis a structure that includes:

  • fd: A file descriptor to be monitored.
  • events: events is an input parameter, a bitmask specifying the events the application is interested in for the given file descriptor.
  • revents: revents is an output parameter which is also a bitmask filled by the operating system with the events that occurred for the given file descriptor.

Some of the events include:

  • POLLIN: Indicates data is available to be read on the file descriptor.
  • POLLOUT: The file descriptor is ready to accept data for writing.
  • POLLERR: There is an error on the file descriptor.

Unlike select(), poll() system calls work efficiently, as they store only the active file descriptors that need to be monitored.

Additionally, it also separates the variables for the events that are required to be monitored from the events that occurred, avoiding overwrites.

epoll()

epoll (Event poll) is a Linux-specific construct serving the same purpose as select() and poll() facilitating multiplexing of I/O operations but in an optimized way.

epoll is a data structure, not a system call that allows an application to multiplex I/O operations. It works on three function calls:

Unlike select() / poll(), epoll() monitors the underlying file descriptors and whenever they are ready for I/O, the kernel adds them to the ready list directly without waiting for the application to call epoll_wait().

Hence, whenever the application calls epoll_wait(), instead of iterating the file descriptors and determining if any of them are ready for I/O, the kernel hands over the ready list directly to the application.

This is in contrast to the select() / poll() system call which iterates through all the file descriptors passed to determine I/O readiness.

Event Loop

Having understood select(), poll(), and epoll() commands, we are better equipped to understand the Event loop in Redis.

Redis is built on a single-threaded, event-driven, non-blocking I/O model leveraging event loop architecture underneath.

  • Single Threaded: Redis runs on a single-threaded model for processing inputs and outputs from the client. By using a single thread, Redis avoids multi-threaded complexity, such as synchronization and locks.
  • I/O bound: Redis is an I/O-bound system because most operations involve reading and writing to memory rather than heavy computations on the CPU.
  • Non-blocking I/O: Redis is built on a non-blocking I/O model using Event loop architecture. It allows operations (e.g., reading from or writing to a socket) to execute without waiting for the operation to complete.
Event loop

Client connection

Redis maintains a client connection data structure representing every connected client. Each client connection is a file descriptor associated with a socket. It includes:

  • Metadata and state of every connected client.
  • Client commands to execute (input) and outgoing responses to be sent back to the client(output).

Every connected client has its instance stored in a global dictionary in Redis with a key as the file descriptor associated with the socket and client connection as the value.

Following are the important pieces of information associated with a client connection:

Poller

Poller is an operating system construct that handles I/O multiplexing efficiently on multiple file descriptors. It internally uses system calls that power I/O multiplexing.

In Linux, epoll() command is used; whereas in MacOS, Kqueue() command is used. It falls back to select() and poll() commands in other operating systems.

The poller component is responsible for:

  • Monitors file descriptors with an incoming command and adds it to the event queue (discussed in the next section) for input processing.
  • Monitors the command execution queue(discussed in the next section) and adds it to the event queue indicating the readiness of response to be written to the client.
  • Checks for timer events such as expired keys, heartbeat pings to the replica server, and periodic snapshots and adds them to the event queue for processing.

Event queue

An Event queue is a conceptual(not physical) data structure in Redis that holds events detected by the Poller, waiting to be processed by the event loop. These events can be File Events or Time Events.

  • File Events handle client-related inputs and outputs. A file represents a file descriptor network socket. They are primarily responsible for reading commands from the clients and writing responses to clients.
  • Time Events are tasks scheduled to run at specific intervals or after a delay.

Data Structure of File Events and Time Events

Event loop

The Event loop is the heart of the Redis, responsible for monitoring incoming commands from the client, executing them efficiently, and sending the response back to the client.

The event loop is single-threaded and runs in an infinite cycle:

  • Monitors new incoming events from the Event queue and processes them using callback handlers.
  • Monitors new outgoing events from the Event queue and places them on the client’s socket as output to be returned to the client.
  • Monitors time events such as expired keys, and periodic snapshotting, and triggers appropriate callbacks.

The event loop is primarily responsible for processing File Events and Time Events continuously and efficiently.

Execution

Now that we understand the different building blocks of an Event loop architecture, we can move forward to stitch these pieces together.

Connection Initialization

When a new client connects to the Redis server, a socket connection is created, and the corresponding file descriptor is associated with the socket.

The client is then subsequently added to the client structure dictionary in Redis. The file descriptor is also registered with the Poller component to monitor for read and write operations.

Incoming requests

When the client sends a command for example SET key value , the command is sent over the socket connection to the Redis server. The file descriptor of the client’s socket becomes readable.

Poller monitors the file descriptor and when it becomes available for reads, it adds it to the list of file descriptors available for reads.

When the event loop invokes Poller for the file descriptors ready for I/O operations, the corresponding file descriptor is returned.

The event loop then parses the command SET key value from the socket and places it in the querybuf input buffer.

Once the query is parsed, it is passed to the command execution module.

Command execution module

The Command execution module is handled by the main thread in the event loop.

Each of the commands has a corresponding execution handler that knows how to execute certain types of commands. For example, SET command’s execution handler identifies the key and stores a value of the type string.

Once the command is executed, the output of the command has to be passed back to the client; in this case, the command’s output OK.

Outgoing response

The event loop formats the response and places it in one of two buffers:

  • buf: If the response is small, Redis places the response in bufdata structure of Redis.
  • reply: If the response is large, Redis places the response in reply data structure which is a linked list construct.

These buffers are per-client buffers stored against the file descriptor in the client structure.

The event loop then checks Poller for the file descriptor sockets ready for writing. The socket becomes available for writes if the corresponding TCP buffer has enough space.

Once the socket becomes writable, the event loop copies the response from the buffer to the socket which in turn is returned to the client and flushes the buffer.

Time Events Processing

Once the file events are processed, the event loop periodically checks for time events from the corresponding data structure ready for execution.

Time events are stored in a priority queue sorted by execution time. The corresponding time event handlers are invoked when the timer expires.

Repeat

The event loop repeats the cycle continuously in an infinite tight loop. By leveraging single-threaded architecture, Redis can process thousands of concurrent requests at low latency.

Limitations of the Event Loop

  • Operations performed by the event loop should be fast. Reads and writes on memory are generally faster, but a single long-running task such as a large Lua script can block the entire thread causing cascading delays.
  • Since Redis uses a single core for processing, its performance may not scale linearly with multi-core CPUs

Data Structures in Redis

This section covers the details of various data structures in Redis.

String

Redis strings are binary-safe and can hold arbitrary sequences of bytes up to 512 MB.

GET key // Returns the value for a given key
SET key value // Set the value for a given key
INCR key // Increments the integer value of key

Redis uses SDS(Simple Dynamic String), a dynamic string library representing strings. It is a custom implementation extending the traditional C implementation of String. It includes:

Length of the string, allocated memory of the string, flags for memory management, actual null-terminated string

SDS

Redis wraps the SDS representation of String with a wrapper object that holds the encoding and type.

Type

It can be a String or Integer.

Encoding

Redis supports three types of encoding depending on the size and the type.

  • embstr is used for shorter strings ( < 39 bytes ) which combines SDS string representation and a wrapper object into a single memory allocation for saving space.
  • raw is used for larger strings ( > 39 bytes ) which allocates SDS string representation and wrapper object in separate memory locations.
  • int is used when the value is numeric and fits into a 64-bit integer.

The combination of SDS structure and wrapper object is considered an entity for representing String in Redis that includes: encoding, data type, and pointer to the actual String representation using SDS.

List

Redis List represents a linked list of String values. It is a doubly linked list which can be traversed in both directions.

LPUSH: Adds a new element to the head of the list.
RPUSH: Adds a new element to the tail of the list.
LPOP: Removes and returns an element from the head of the list.
RPOP: Removes and returns an element from the tail of the list.
LLEN: Length of the list.

Redis Lists are implemented using a data structure called QuickList.

QuickList

Quick List is a doubly linked data structure with the following information:

Head pointer to the head node, Tail pointer to the tail node, Len indicates the number of nodes in the list and Count indicates the total number of elements in the list.

Each node in the Quicklist is called ListPack Nodewhich has a head pointer to the previous node in the list and a tail pointer to the next node.

Each ListPack includes Bytes indicating the size of the Listpack, Length indicating the number of data elements, and a sequence of elements called Entries.

The data node is the ListPack. A ListPack is a compact serialized representation of a sequence of elements designed to reduce memory footprint and provide comparable read and write performance. In the above example, each ListPack is a sequence of 3 elements.

Elements are encoded compactly based on the size and type(integer or string). The encoding scheme is optimized to use as few bytes as possible:

  • Integer: 1 byte for smaller integers, up to 8 bytes for larger integers.
  • String: Stored as length prefixed string. The length of the string can be 1, 2, or 5 bytes depending on the length of the string.

Thus, the Listpack algorithm provides a better memory representation by smart encoding and allocation strategies while the Listpack nodes connected by a QuickList data structure provide bi-directional traversal.

Hash

Redis Hash represents a map of key-value pairs.

HSET mainkey key1 value1 key2 value2 // Set key-value pairs to the mainkey
HGET mainkey key1 // Returns the key associated with mainkey and one of the key inside
HGETALL mainkey // Returns all the key-value pairs associated with the main key

Redis hashes are implemented using two mechanisms:

  • ListPack : Redis uses Listpack (described above) for hash when the number of key-value pairs is below a threshold (dictated by hash-max-listpack-entries) and the size of each key-value pair is below a threshold (dictated by hash-max-listpack-value). Each entry is a key-value pair.
  • Hashtable : Redis implements hash using a HashTable if the ListPack conditions described above are violated.
Hash Table

Redis maintains two dictionary structures namely Dict 1 and Dict 2 of which one of them is active.

Redis computes the hash value using the MurmurHash algorithm for a given key. Then this hash value is mapped into one of the buckets from 0 to N. Each bucket then stores a linked list of key and value pairs.

Set

Redis Set is an unordered collection of unique elements.

SADD key member1 // Adds a new member to the set.
SREM key member1 // Removes a member from the set.
SISMEMBER key member1 // Checks if a string is part of the set.
SINTER key1 key2 // Returns the common element in the two sets.
SUNION key1 key2 // Union of two sets
SCARD key // Returns the size of the set.

Redis implements a set using two implementations depending on the size and the type.

  • Intset: If the elements in the set are integers and the number of elements is small(dictated by set-max-intset-entries configuration), intset is used which stores elements in a sorted array. This ensures continuous memory allocation and provides efficient lookups using binary search.
  • Hashtable : If the elements in the set are non-integers or larger than set-max-intset-entries , then the elements are represented in a Hashtable where the key is the element and the value is NULL.

Sorted Set

Redis Sorted Set is a collection of unique elements ordered by an associated score.

ZADD key score value // Add a value with score for a given key
ZRANGE key start stop [WITHSCORES] // Return members in ascending order of scores between start and stop indexes
ZRANGEBYSCORE key min max [WITHSCORES] [LIMIT offset count] // Retrieve members with scores within the range of min, max
ZRANK key member // Returns the rank of member in a given key in ascending order
ZSCORE key member // Returns the score of member in a given key
  • ListPack: Redis uses ListPack (described above) when the number of elements is below a threshold (dictated by zset-max-listpack-entries) and the size of each member is below a threshold (dictated by zset-max-listpack-value).
  • Skiplist and HashTable: Redis uses a combination of Skiplist and HashTable if the threshold of ListPack is breached.

Redis Hash table

Redis has a global dictionary (or database keyspace) where all keys and their associated values are stored. This dictionary is implemented as a hash table within each Redis database.

The Global dictionary implementation is similar to the Redis hash table data structure.

Channels

Redis channels are part of Redis’s publish/subscribe messaging system. It allows clients to subscribe to channels and receive messages published in real-time.

  • Channels: Named communication endpoints to which clients can subscribe or publish messages.
  • Subscriptions: A client subscribes to one or more channels or patterns
  • Messages: Any client publishing a message to a channel sends the message to all clients subscribed to that channel.

There are two types of channel subscriptions:

  • Channel Name: Redis uses a HashTable to store channels and their respective subscribers where the key is the channel name and the value is a list of client structures.
  • Pattern: Pattern subscriptions are stored in another HashTable where the key is the pattern and the value is a list of client structures of subscribers.

Redis Persistence

Redis provides two persistence options: RDB(Redis database file) and AOF(Append only file) to make the volatile data in memory durable.

RDB

Redis snapshots the entire dataset and saves it to a binary file RDB at regular intervals or specific triggers. This process can be a blocking call using the SAVE command or a background process using the BGSAVE command. During graceful shutdown, Redis automatically creates an RDB file.

SAVE <seconds> <keys_changed> // invoke RDB if atleast <keys_changed> number of keys changed in the last <seconds> time frame.
BGSAVE <seconds> <keys_changed> // Same as SAVE command but triggered asynchronous

When a snapshot is initiated using the background process, Redis forks a child process which writes the dataset to disk, while the main process continues to serve client requests.

Redis uses the Copy-on-Write technique to take snapshots of the dataset. It generates an RDB file which is the binary representation of the dataset. Upon restart, Redis reads the RDB file and reconstructs the dataset in memory.

RDB is a binary file format designed to encode the in-memory state of Redis data structures in a highly compact representation. The structure of the format is described below:

[Header] -> [Database N] -> [Key-Value Pairs encoded] -> EOF -> [Checksum]

The header includes the metadata such as the Redis version. Database N represents the database for which a snapshot is taken.

This is followed by key-value pairs serialized using efficient encoding to minimize size.

Each key in Redis is serialized along with the Key Type indicating the data type, the Key Name serialized as a length-prefixed string and TTL indicating expiration time.

Each value in Redis is serialized using different encoding techniques to ensure compactness.

  • String: Redis uses length-prefixed encoding for string and strings representing integers are encoded with binary integers.
  • List: Redis encodes lists using ListPack for a smaller list for better efficiency and Quicklist for larger lists.
  • Set: Redis encodes Set using intset for smaller integer sets and HashTable otherwise.
  • Hashes: Redis encodes hashes using ListPack for smaller hashes and HashTable otherwise.
  • Sorted Set: Redis encodes Sorted sets using ListPack for smaller sets, while it leverages SkipList and HashTable otherwise.

RDB format adds a checksum at the end to maintain the integrity of the data.

Once the data is encoded, Redis compresses certain parts of the RDB file to reduce disk usage. Large values (e.g., strings, lists) are compressed using the LZF algorithm. Compression occurs on a per-value basis.

AOF

AOF logs every write operation in a human-readable format as a sequence of commands. Operations are appended to a log file in the order they are executed.

Over time, AOF files can grow large. Redis periodically rewrites the file to remove redundant operations. (Ex: collapsing multiple INCR commands into a single SET)

There are three types of synchronization policies to sync the file to disk:

  • always : After every write.
  • everysec : Every second.
  • no : The operating system handles the synchronization

On restart, Redis replays the commands in the AOF file to reconstruct the dataset during startup. This can take longer than loading an RDB file, for a larger dataset.

Hence, Redis allows the use of both RDB and AOF for persistence. RDB provides faster restarts and periodic backups. AOF ensures higher durability by logging every operation. This hybrid approach appends commands to the AOF file while periodically synchronizing with RDB snapshots to simplify recovery and reduce AOF size.

This completes the article on the internals of the Redis database.

Appendix

  1. https://wangziqi2013.github.io/article/2023/02/12/redis-notes.html
  2. https://copyconstruct.medium.com/the-method-to-epolls-madness-d9d2d6378642
  3. https://en.wikipedia.org/wiki/File_descriptor
  4. https://medium.com/devops-dev/redis-underlying-data-structure-everything-you-need-to-know-aafd99840c2b

--

--

sudan
sudan

No responses yet